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Difficult to understand results 


You see an animated picture of the system 
you are studying-any application 


Announcing new 

PC AT SIMSCRIPT II.5 with SIMANIMATION 
Your simulation results are now easy-to-understand 


W ith PC SIMSCRIPT II.5 
and the new simulation 
animation, your results are easy to 
understand-an animated picture, 
histograms, pie charts and plots. 

Because your results are under¬ 
stood, your recommendations are 
more likely to be acted upon. 

You can save your organization 
lots of money and further your 
career. 

Simulation animation simplified 

SIMSCRIPT II.5 gives you a 
compact English-like language. 
Your simulation program reads like 
a description of the system you are 
studying. Animation is easy. 

Your model development, check¬ 
out, modification and enhancement 
are greatly simplified. 

Many successful applications 
SIMSCRIPT II.5® is a well 
established, standardized, and 
widely used language with proven 
software support. 

Typical applications include: 
military planning, manufacturing, 
communications, logistics, and 
transportation. 


Free trial and training 

The free trial contains every¬ 
thing you need to try PC SIM¬ 
SCRIPT II.5 on your computer. 

We send you PC SIMSCRIPT 
II.5, installation instructions, sam¬ 
ple models, and a complete set of 
documentation. 

You can build your model or 
modify one of ours. If you have 
questions, just call us for an im¬ 
mediate response. 

No cost, no obligation. 

Act now—limited offer 

For a limited time we also in¬ 
clude free training. Typical ap¬ 
plications are explained and 
demonstrated. 

Call today to avoid disappoint¬ 
ment. 

For immediate information 

Call Rick Crawford at (619) 
457-9681. In the UK, call Steve 
Wombell on (01) 940-3606. 

With PC SIMSCRIPT U.5 you 
get results sooner and they are 
better understood. 
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Guest Editors’ Introduction: 

Systolic Arrays—From Concept to Implementation 

Jose A.B. Fortes and Benjamin W. Wah 

Systolic arrays have regular and modular structures that match the computational requirements of many algorithms. Their 
implementation requires that a wealth of subsumed concepts and engineering solutions be mastered and understood. 


18 


Wavefront Array Processors— Concept to Implementation 

S. Y. Kung, S.C. Lo, S.N. Jean, and J.N. Hwang 

Most signal and image processing algorithms can be decomposed into computational wavefronts that 
pipelined arrays. 
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The Saxpy Matrix-1: A General-Purpose Systolic Computer 

David E. Foulser and Robert Schreiber 

The Matrix-1 employs a programmable and reconfigurable systolic array that achieves nearly gigaflop performance for problems in 
signal processing and matrix computation. 

SLAPP: A Systolic Linear Algebra Parallel Processor 

Barry L. Drake, Franklin T. Luk, Jeffrey M. Speiser, and Jerome J Symanski 
Recent signal-processing algorithm developments have stressed direct methods that operate on the data matrix via orthogonal matrix 
decompositions. Systolic arrays appear to be quite well matched to the requirement for real-time computation of these algorithms. 

Some Systolic Array Developments in the United Kingdom 

John V McCanny and John G. McWhirter 

TVvo major UK systolic array projects are described. The first concerns development of a wavefront array processor for adaptive 
beamforming, the second the design of novel high-performance signal-processing chips. 

Fault Tolerance Techniques for Systolic Arrays 

Jacob A. Abraham, Prithviraj Banerjee, Chien-Yi Chen, W. Kent Fuchs, Sy-Yen Kuo, and A.L. Narasimha Reddy 
This article describes various techniques for fault tolerance that can be applied to systolic array architectures. The approach of 
algorithm-based fault tolerance is shown to be the natural one for such systems. 

Partitioning: An Essential Step in Mapping Algorithms Into 
Systolic Array Processors 

Juan J. Navarro, Jose M. Llaberia, and Mateo Valero 

The efficient solution of a large problem on a small systolic array requires good partitioning techniques to split the problem into 
subproblems that fit the array size. 

Systolic Arrays: A Survey of Seven Projects 

This special section provides a concise overview of seven projects concerned with the design and implementation of systolic arrays. 


















DEPARTMENTS 


6 President’s Message 

7 Letters to the Editor 
104 Open Channel 

106 Standards 
108 Update 

110 Computer Society News 

112 Conferences 

113 Call for Papers 
115 Calendar 

120 New Products 

127 Microsystem Announcements 

128 IC Announcements 

129 Computer Society Officers and Information 

130 Roster of Computer Society Committees 
133 Book Reviews 

135 New Literature 



On the cover 

Guest editors Jose Fortes and Benjamin 
Wah point out that systolic arrays are so 
called because analogies can be drawn 
between their architecture and that of the 
human circulatory system. The human 
heart sends and receives a large amount 
of blood as a result of the frequent and 
rhythmic pumping of small amounts of 
that fluid through the arteries and veins. 
In this analogy, the heart corresponds to 
the source and destination of data (such 
as a global memory), and the network of 
veins is equivalent to the array of proces¬ 
sors and links. Another analogy is that 
in many of the first proposed systolic 
architectures, processing elements alter¬ 
nated between cycles of “admission” 
and “expulsion” of data—much in the 
same way that the heart behaves with 
respect to the pumping of blood. 
Cover image: Benn Mitchell © Image 
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“Personal Consultant™ Plus... offers 
a very fine expert system development 
and delivery tool that already has 
a proven record with end-users*” 

— Susan Shepard, AI Expert 


Personal Consultant Plus 3.0 Standard Features 

- Frames, rules, meta rules and procedures 

- Forward/backward chaining 

- Confidence factors 

- Regression testing and rule tracing 

- End-user explanation facilities 

- Graphics image capture and display 

- Interfaces to dBase”, Lotus 1-2-3”, DOS files, 
.EXE or.COM programs, 'C' 

- Complete LISP development environment 

- 2-megabyte expanded/extended memory support 

- Context sensitive help 

- 'Getting Started' tutorial-style manual 
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what serious expert 
Power tools. 



Among all the expert system devel¬ 
opment tools available for personal 
computers today, none deliver the 
power and flexibility of TI’s Personal 
Consultant series. 

Personal Consultant Easy is ideal for 
getting started, and is upwardly com¬ 
patible with the higher functionality of 
PC Plus. For experienced developers, 
Personal Consultant Plus and its 
optional add-on enhancements, Online 
and Images, were designed to help solve 
a broader range of complex problems. 


package helps deliver expertise that is 
“online all the time.” 

Application delivery as flexible as the 
tools themselves. 

Delivery can be in LISP for flexibility, 
or “C”* for maximum speed and porta¬ 
bility. Our “C” options support either 
stand-alone or “embedded” knowledge 
bases. Options are available for DOS- 
based PCs, TPs Explorer, and DEC’s 
VAX™ line of multi-user minis running 
under VMS™. 



Personal Consultant Plus. Full power 
for an affordable price. 

At $2,950, PC Plus has proven to be 
one of the richest and most flexible 
problem-solving tools available for the 
development of complex knowledge- 
based systems. Designed to take 
advantage of today’s more powerful 
286/386 DOS-based computers, or TI’s 
Explorer™ Symbolic Processing System, 
the new 3.0 version of PC Plus provides 
powerful standard features and a contin¬ 
uing growth path with the addition of 
either PC Images or PC Online, or both. 

Personal Consultant Images. Picture 
an expert system with interactive 
graphics. 

At $495, PC Images enables developers 
to create knowledge-based applications 
that incorporate complex graphical 
“active images.” User-interactive dials, 
gauges, forms and selection images pro¬ 
vide a more exciting visual data input 
and output style. 

Personal Consultant Online. The 
expert system as part of the process. 

At $995, PC Online allows the devel¬ 
oper to design expert systems which 
interact directly with process data, as 
opposed to input from a human oper¬ 
ator. Designed for intelligent process 
monitoring applications, this optional 


“Texas Instruments has done more 
than any other company to educate 
people about AI, to popularize it, and 
to make useful AI tools available at 
reasonable prices. ” 

—Jim Seymour, PC Magazine. 


Technical support, training courses and 
Knowledge Engineering Services are 
available for the Personal Consultant 
products. If you have a question about 
any of our expert system power tools, we 
have the answer. 


Pick up the phone and gain a powerful 
advantage. 

Call 1-800-527-3500 for technical 
overviews of our products and a PC Plus 
case histories brochure which details 
how our power tools are being put to 
work today. 
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Texas Instruments Incorporated. 
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C omputer Society chapters offer 
members a way of keeping up 
to date in the computing field 
via local activities. In this month’s 
column, Willis King, vice president for 
area activities, describes our chapter 
activities program. 

Roy L. Russo, president 

Chapter activities 

Individual reasons for joining a 
professional organization differ some¬ 
what, but most of us join the Computer 
Society to keep up with the new knowl¬ 
edge created daily in our field and to 
exchange technical ideas with our peers. 
The Area Activities Board (AAB) is 
chartered to serve our members locally. 
By supporting the viability of the local 
chapters in general, by helping to 


enhance their technical programs, and 
by assisting in the establishment of new 
chapters when the situation arises, the 
AAB provides all members a convenient 
and low-cost means for continuing edu¬ 
cation and for fraternization with fel¬ 
low members. 

A chapter is a microcosm of the 
Computer Society itself. Each can, 
through its elected officers, hold techni¬ 
cal meetings and conferences, conduct 
tutorials and workshops, and publish 
newsletters and other technical commu¬ 
nications. Chapters are the grass roots 
organizations from which the Computer 
Society draws its future leaders. We 
urge you to exercise your membership 
privileges by participating in chapter 
activities. Through your local chapter 
you can easily make your views and 
wishes known. There are over 200 regu¬ 
lar and student chapters located all over 
the world. No matter where you are, 
you should not be far from a chapter. 



Region 3, Southeastern US 

Region 6, Pacific US 

Region 8, Europe, 


Sajjan G. Shiva 

Dennis Reinhardt 

Middle East, and Africa 

Area chairs 

Computer Science Dept. 

DAIR Computer Systems 

Roland J. Saam 

University of Alabama 

3440 Kenneth Dr. 

Micros for Managers 


Huntsville, AL 35899 

Palo Alto, CA 94303 

149 Gloucester Rd. 


(205) 895-6160 

(415) 494-7081 

London SW7 4TH 

England 

44 (1)370-5125 

Region 1, Northeastern US 

Region 4, Central US 

Region 7, Canada 

Region 9, Latin America 

Keith Barker 

Alicja I.Ellis 

Micha Avni 

Carlos Fronterotta 

Computer Science 

2747 Yosemite Ave. S. 

Complexe Guy Favreau 

Avenue Sao Gabriel, 555 

and Engineering 

St. Louis Park, MN 55416 

200 Dorchester W. 

Suite 407 

University of Connecticut 

(612) 541-2063 

West Tower, Suite 601 

01435 Sao Paulo, SP 

U155, 260 Glenbrook Rd. 


Montreal, 

Brazil 

Storrs, CT 06268 
(203) 486-2566 


Quebec H2Z 1X4 

Canada 
(514) 283-0004 

55 (11)284-3861 

Region 10, Asia 

Region 2, Eastern US 

Region 5, Southwestern US 

N.V. Balasubramanian 

Harry K. Frost 

Arthur Altman 


Argyle Centre, Tower II 

Performance Associate Corp. 

TI Computer Science Center 


700 Nathan Rd. 

5 Bayard Rd. 

MS 238, PO Box 226015 


Mongkok, Kowloon 

Pittsburgh, PA 15213 

Dallas, TX 75226 


Hong Kong 

(412) 561-1280 

(214) 995-0383 


852 (3) 984321 
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This year, we are very fortunate to 
have a group of most capable and ener¬ 
getic volunteers helping us conduct 
AAB business. Your area chair is the 
leader in your region with the knowl¬ 
edge and authority to assist you in con¬ 
ducting your chapter activities. If you 
have any questions in running your 
local chapter, or if you need additional 
resources to launch a new activity, 
please feel free to contact your area 
chair. (For a list of area chairs and their 
addresses, see box.) 

The Distinguished Visitor’s Program 
is intended to enrich the technical pro¬ 
grams of local chapter meetings at prac¬ 
tically no cost to the chapters. Every 
year, between 30 and 40 prominent 
scientists and engineers are asked to 
serve on this program. Subject to the 
approval of the program chair and the 
consent of the visitor, a chapter can 
arrange to have a distinguished visitor 
speak at its local meeting. The distin¬ 
guished visitor donates his service, and 
the intercity travel expenses are paid by 
the Computer Society. For further 
information on this program, please 
contact Tse-yun Feng, Department of 
Electrical Engineering, Penn State Uni¬ 
versity, University Park, PA 16802; 
(814)863-1469. 

The Chapter Tutorial Program offers 
one-day tutorials on a variety of 
introductory and advanced subjects at a 
very low cost to members. For further 
information, please contact Alicja Ellis, 
2747 Yosemite Ave. S., St. Louis Park, 
MN 55416; (612) 541-2063. 

Student members are particularly 
dear to us. They represent our future. 
We have programs tailored to their 
needs. Our Richard Merwin Scholar¬ 
ships are awarded annually to student 
chapter leaders who are also excellent 
scholars. The MicroMouse contest is 
designed to provide entertainment as 
well as intellectual challenge to our stu¬ 
dent members. Student officers and 
their faculty advisors who need further 
information and help may contact 
Murali Varanasi, Department of Com¬ 
puter Science and Engineering, Univer¬ 
sity of South Florida, Tampa, FL 
33620; (813) 974-3033. 

The Computer Society is your soci¬ 
ety. We welcome suggestions that help 
us serve you better, and we particularly 
welcome offers to help with our many 
programs. I am looking forward to 
hearing from you. 

Willis K. King 

Vice-President for Area Activities 
Department of Computer Science 
University of Houston 
Houston, TX 77004 
(713) 749-4791 
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“Silver Bullet” on target 

To the editor: 

I thoroughly enjoyed reading F.P. 
Brooks’ “No Silver Bullet” ( Computer, 
April 1987) in which he predicts that 
there will be no significant improve¬ 
ments in software engineering method¬ 
ology. It reminded me of the US Patent 
Office director’s classic turn-of-the- 
century prescience that there would be 
no further inventions of any value, 
since everything of any possible sig¬ 
nificance had already been invented. 
Brooks, himself a major contributor to 
software engineering, now in essence 
predicts there will be no followers. 

While I claim none of Brooks’ 
impressive credentials, I have been 
working with a high-level language 
somewhat different from the sort he 
considers. Rather than creating a “tool- 
mastery burden that increases [with the 
degree to which it] furnish[es] all the 
constructs that the programmer 
imagines in the abstract program,” the 
S/R language 1 has a simple Pascal-like 
syntax which serves as a data-flow lan¬ 
guage constructor. With the same S/R 
language, special-purpose languages can 
be created “on the fly” for such diverse 
purposes as to create software to imple¬ 
ment a communication protocol, 2 create 
a software prototype for the implemen¬ 
tation of integrated hardware, 3 create a 
model for analysis of bond market 
transactions, strategy analysis, imple¬ 
ment a Petri net description language, 
or simply implement algorithms for 
ordinary numerical computations, to 
name a few applications I have tried. In 
each such application a few minutes (or 
hours) of preparation is spent first to 
create the appropriate special-purpose 
data structures — in effect, to create a 
new special-purpose language. Then the 
required programs, algorithms or 
architectures are created using these 
specially tailored structures. 


Computer welcomes your letters. Send them 
to Letters Editor, Computer, 10662 Los 
Vaqueros Cir., Los Alamitos, CA 90720. All 
submissions are subject to editing for style, 
space, and clarity considerations. 


Moreover (and probably, first and 
foremost), the S/R language has a 
semantics which facilitates formal sym¬ 
bolic analysis of its programs. Specifi¬ 
cally, a software system called COSPAN 
implements algorithms used to test S/R 
programs for any omega-regular prop¬ 
erty. The problem of computational 
intractability normally associated with 
such tests is ameliorated (in “most” 
cases) by algorithms in COSPAN which 
perform formal reductions in conjunc¬ 
tion with the symbolic analysis. S/R 
programs, thus analytically debugged, 
compile into serial or parallel C-code 
programs (at the user’s direction). Cur¬ 
rently, a project is under way for silicon 
compilation of S/R programs. 

While all this is quite new and the 
jury is still out on whether this will be a 
“silver bullet,” first indications suggest 
possibly as much as two orders of mag¬ 
nitude increase in the number of lines of 
debugged C-code which can be 
produced per programmer-hour. At any 
rate, it suggests that there still may be 
room for significant breakthroughs in 
software engineering. 

R. P. Kurshan 

AT&T Bell Labs 

Murray Hill, N.J. 

1. J. Katzenelson and R.P. Kurshan, “S/R 
Language . . . ,” Proc. Fifth Ann. Int’l 
Phoenix Conf. on Computers and Commu¬ 
nications, Computer Society, Los Alamitos, 
Calif., 1986, pp. 286-292. 

2. R.P. Kurshan, “Proposed Specification of 
the BX.25 Link Layer Protocol,” Bell Sys. 
Tech. J., Vol. 64, 1985, pp. 559-596. 

3. I. Gertner and R.P. Kurshan, “Logical 
Analysis of Digital Circuits,” Proc. Eighth 
Int’l Conf. Computer Hardware Descrip¬ 
tion Languages, IFIP, Amsterdam, 1987. 


To the editor: 

As a software practitioner (and espe¬ 
cially after attending the Monterey 
ICSE [International Conference on 
Software Engineering]), I find Fred 
Brooks’ comments on the state of the 
software arts refreshing ( Computer, 
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April 1987, p. 10). If only the computer 
science establishment (well represented 
at the conference) also listened to the 
voices from the flock once in a while... 

In another regard, however, I believe 
that Brooks’ comments are really off 
the mark, namely in his conclusion that 
software is incomparably more difficult 
to master than hardware. To paraphrase 
a famous comedian: “I’ve done hard¬ 
ware, and I’ve done software. Believe 
me, software is better!” When designing 
hardware, engineers must fight physics. 
In software, they only fight their own 
limitations. Put differently, in software, 
everything is possible, and therein lies 
the rub. 

I have noticed recently that among 
engineers a new view of software has 
begun to emerge — that it can be con¬ 
trolled by imposing on it some of the 
constraints that nature imposes on hard¬ 
ware. If we regard data as signals and 
control as logic gates, we can create 
“software ICs.” Provided the “pins” 
are labelled correctly and proper design 
rule checks are performed on the “net 
list,” software components can then be 
interconnected as reliably as hardware 
components. 

Thanks to CAE technology, such 
checks are standard practice in hard¬ 
ware. Unlike Brooks, I believe that even 
in software they are no longer just desir¬ 
able, but eminently practical_ 

I suspect the main reason for soft¬ 
ware’s apparent intractability lies not so 
much in its complexity, but in the way it 
is taught. Few computer science curric¬ 
ula include courses on the principles of 
design. Yet, CS graduates should know 
at least as much about design as EEs 
know about programming.... 

It became obvious in the plenary ses¬ 
sions of ICSE that the CS mainstay — 
mathematical formalism — only applies 
to a few percent of the total software 
engineering effort.... 

Worse yet, most software errors today 
(not counting faulty specifications) can 
probably be traced to the inability of 
programmers to deal with present levels 
of abstraction.... 

Even in the academically biased 
atmosphere of ICSE, the one comment 
from the audience that drew applause in 
a panel session was this: Circulate more 
software people between academic and 
industrial environments. This approach 
has worked wonders in the hardware 
world, but software folk on both sides 
seem to shy away from it. I suspect, 
mathematicians don’t care about indus¬ 
try, and vice versa. 

Max J. Schindler 

Prime Technology 


To the editor: 

While I enjoyed Fred Brooks” article 
in the April issue, “No Silver Bullet,” I 
also feel he has helped reinforce and 
perpetuate more software development 
folklore than he eliminated. 

According to Brooks, the key to suc¬ 
cessful software development is that it is 
carried out by one software engineer (or 
very few). Unfortunately, real-world 
software development doesn’t follow a 
story-book plot. Today’s complex soft¬ 
ware has grown beyond the capacity of 
an individual — or even a small group 
— to handle. 

I noticed on Brooks’ bio that he was 
involved in the development of the 360. 
That, in itself, is an excellent example 
because the operating system and appli¬ 
cations software weren’t developed by a 
single group or even by IBM itself.... 

Today, we have a different situation. 
Software engineers coming out of 
school have absolutely no idea what is 
going on in the operating system 
because they aren’t trained in (nor 
allowed to tinker with) the internals. But 
it’s OK because, in today’s computer¬ 
ized organization, no one can be 
expected to have the expert knowledge 
or time needed to manage, analyze, 
design, code, test, document, and main¬ 
tain an extensive and detailed software 
system. For this reason, it is safe to say 
that software development is a team 
effort. 

As with all group activities, the team 
developing a new software package must 
be coordinated to be efficient and effec¬ 
tive. Yourdon and others have been 
preaching (and proving) that to be truly 
successful, software developers have to 
get away from perpetuating the myth 
that software development is an art. 
Instead, they have to take the approach 
that it is a (software) engineering and 
manufacturing process. 

Once this is done, productivity will 
increase dramatically. 

.. .Just as software itself is a produc¬ 
tivity tool, software engineers need to 
rely on productivity tools to do their 
jobs. But individual tools fall short of 
fulfilling the market’s needs. 

A better solution is a total software 
engineering environment (SEE) that 
takes the group dynamics of managerial 
and technical relationships into account. 
This approach links, coordinates, and 
manages the activity of the group. It 
relays information and acts as the true 
development environment for the 
deliverable product.... 

G.A. Marken 
Marken Communications 


Brooks replies... 

Max Schindler and G.A. Marken are 
each pushing their own “silver bullets.” 
I obviously advocate and support dis¬ 
ciplined engineering practice and SEEs. 
Neither of their proposals are, however, 
accompanied by analyses as to how they 
can even conceivably give order-of- 
magnitude improvements. My central 
argument is that so much of the soft¬ 
ware task is now essence — the fashion¬ 
ing of the conceptual constructs 
themselves — that any truly promising 
attack must aim at that part. So, as to 
that central argument, neither nostrum 
upsets my rostrum. 

R.P. Kurshan’s letter, on the other 
hand, has both preliminary data and a 
theoretical argument as to why the S/R 
language addresses the software prob¬ 
lem at the concept-constructing, or 
essence, level. It would seem from his 
letter to be very similar to object- 
oriented programming, and it simplifies 
concept construction in much the same 
way. Perhaps that bullet is indeed silver 
— I hope so. 

Concerning Schindler’s “Believe me, 
software is better!”, I can only say that 
we have had different experiences. 

Since entering the computer field in 
1952, my efforts, time, and concerns 
have been about equally divided 
between hardware and software. (In 
1961-65, for example, I was project 
manager for System/360 hardware 
from inception through first customer 
shipment, and project manager of 
Operating System/360 and all its com¬ 
pilers and utilities, from the first design 
of that project through Alpha Test.) 

I quite agree with Schindler that soft¬ 
ware is in principle easier to do, for 
exactly the reasons he describes (see my 
The Mythical Man-Month, Chapter 1). 
In practice, however, I have found soft¬ 
ware much more difficult to manage 
and to construct on large scale. This is 
due, I think, in part to the immaturity 
of the discipline and in part to the diffi¬ 
culties of doing quantitative assessment 
and measurement on conceptual con¬ 
structs that are so nearly disembodied. 
Frederick P. Brooks, Jr. 

...and adds his own 
comments 

Re: “No Silver Bullet — Essence and 
Accidents of Software Engineering,” 
reprinted in your April issue. 

To give credit as due, the illustrations 
and the explanatory sidebar, “To Slay 
the Werewolf” (p. 13), were the work of 
the Computer Society editors, not part 
of the paper as delivered at IFIP 86. I 
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was not aware of them before publi¬ 
cation. 

I should like to acknowledge many 
rich discussions with my colleagues on 
the Defense Science Board Task Force 
on Military Software. The views in the 
paper are mine, of course, not those of 
the task force. 

Analyzing the software problem into 
the Aristotelian categories of essence 
and accident was inspired by Nancy 
Greenwood Brooks, who used such 
analysis in a paper on Suzuki violin 
pedagogy. 

Frederick P. Brooks, Jr. 

University of North Carolina 


Comparison questioned 

To the editor: 

The article in your March issue by 
Kirk Jordon of Exxon Research entitled 
“Performance Comparison of Large- 
Scale Scientific Computers” may have 
left your readers with some erroneous 
impressions of the IBM 3090 Vector 
Facility. 

The author ran his codes at the IBM 
Washington Systems Center before the 
general availability of our vector prod¬ 
uct and associated software. He used a 
pre-release version of our VS Fortran 
compiler, which was not intended to be 
shipped as a generally available prod¬ 
uct. The data published should not be 
considered as representative of the per¬ 
formance of current IBM products. 

Troy L. Wilson 

IBM Data Systems Division 

Author’s reply: 

In response to the letter to the editor 
from Troy L. Wilson of IBM Data Sys¬ 
tems Division, I appreciate his thought¬ 
ful and carefully worded comments. 

The benchmark work reported in this 
article like any benchmark work 
represents a “snap-shot” comparison. 
Since then, I believe all the vendors 
have improved the performance of their 
hardware or software or both. 

With regard to his comment on the 
VS Fortran compiler, I used a pre¬ 
release version of VS Fortran Level 
2.1.0. recommended by IBM. To try to 
avoid giving any erroneous impressions, 
I included the version and release num¬ 
bers of all the vendors’ compilers and 
operating systems and noted that the 
IBM Compiler was a pre-release 
version. 

Kirk E. Jordan 

Exxon Research and Engineering 


Wrong algorithm used? 

To the editor: 

I read with great interest the article 
“Multiprocessing the Sieve of 
Eratosthenes” in Computer (April 
1987). My research at Columbia Uni¬ 
versity includes the problems of primal- 
ity detection and factorization by use of 
massively parallel machines (i.e., many 
thousands of processors).... 

The Sieve of Eratosthenes may be the 
best algorithm for finding all primes 
between 1 and N, for N of reasonable 
size. However, this depends on the 
model of computation. The implemen¬ 
tation of an algorithm under the model 
depends on the issues you describe, such 
as load balance, trade-offs between sig¬ 
nal and process time, and programming 
techniques. 

However, it is crucial to consider 
other elements in the problem. I wonder 
why your article ... does not consider 
these. For example, other algorithms 
would be appropriate if there were a 
large number of processors. 

In the evaluation of parallel architec¬ 
tures, it is crucial to use the best possi¬ 
ble algorithm to solve the problem. The 
April 1987 article makes the error of 
using the wrong algorithm to evaluate 
the architecture. The Sieve of 
Eratosthenes is not the “only procedure 
for finding prime numbers,” nor is is 
the best for finding large prime num¬ 
bers. Indeed, the available parallelism 
of “p = 6” (p. 57) is certainly not 
inherent to the problem. 

A speedup linear in the number of 
processors is achievable in the primality 
detection problem. This can be done by 
use of randomized algorithms and mas¬ 
sively parallel machines. Several such 
machines were described in the January 
1987 Computer. This linear speedup has 
been demonstrated, as part of my 
research at Columbia University on the 
DADO parallel computer. This allows 
detection of 50 digit prime numbers in a 
matter of seconds. Similar work has 
been reported for the MPP computer. 

In addition, the “lower bound” of 
Figure 11 pertains only to memory 
references. This is distinct from the 
problem complexity, which is almost 
polynomial. By use of randomized 
algorithms a nearly linear speedup can 
be obtained on massively parallel 
machines.... 

Mark Lerner 

Columbia University 


Author’s reply: 

As far as I am aware, the Sieve of 
Eratosthenes and its variants are the 
only algorithms for finding all prime 
numbers between 1 and some given N. 
Lerner confuses this problem with the 
problem of determining if a given num¬ 
ber is prime. In any case, the objective 
of my article was to use a parallel ver¬ 
sion of the Sieve as “a test of some of 
the capabilities of a parallel machine” 
(p. 50, 1 3) and “not to generate prime 
numbers efficiently” (p. 51, line 19). 

I disagree with Lerner that it is “cru¬ 
cial to use the best possible algorithm to 
solve the problem” in the evaluation of 
parallel architectures. A specific experi¬ 
ment may measure only some aspects of 
an architecture, the particular algorithm 
used is irrelevant. The proximity of the 
observed run time curve to the lower 
bound in Figure 11 (p. 58) is a measure 
of a machine’s efficiency. When the 
experiment is run on another machine, 
the curves may be closer together or 
further apart — this has nothing to do 
with the optimality of the algorithm. 

I have been unable to find any 
research reports or publications related 
to prime number detection on DADO 
or MPP. Perhaps Lerner will describe 
his research in a forthcoming issue of 
Computer. 

Shahid H. Bokhari 
University of Engineering 
& Technology 
Lahore, Pakistan 


No sixth reference 

To the editor: 

Although I dislike being perceived as 
a complainer, it seems to me that refer¬ 
ence number 6 (referred to in para. 2 of 
col. 2 on p. 99 in Computer, May 1987) 
is missing from the list of references 
(bottom of col. 3, same page, same 
issue). If this is the case, would you 
please consider having the missing 
reference published as an addendum in 
a future issue? 

Thom Grace 

Illinois Institute of Technology 

The references are complete. The sixth 
reference was deleted by the author 
after the article had been typeset. 
Unfortunately, the editors overlooked 
the corresponding deletion of its in-text 
citation. 

Ed. 
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Systolic arrays have 
regular and modular 
structures that match 
the computational 
requirements of many 
algorithms. Their 
implementation 
requires that a wealth 
of subsumed concepts 
and engineering 
solutions be mastered 
and understood. 


S ystolic arrays are the result of 
advances in semiconductor tech¬ 
nology and of applications that 
require extensive throughput. Their reali¬ 
zation requires human ingenuity combined 
with techniques and tools for algorithm 
development, architecture design, and 
hardware implementation. 

Invariably, the first reaction of people 
who are exposed to the systolic-array con¬ 
cept is one of admiration for the concept’s 
elegance and for its potential for high per¬ 
formance. However, those who next 
attempt to implement a systolic array for 
a specific application soon realize that a 
wealth of subsumed concepts and engi¬ 
neering solutions must be mastered and 
understood. This special issue attempts to 
provide insights into the implementation 
process and to illustrate the different tech¬ 
niques and theories that contribute to the 
design of systolic arrays. 


Characteristics of 
systolic arrays 

Since 1978, when H.T. Kung and C.E. 
Leiserson 1 introduced the term “systolic 


array” and the concept behind the term, 
much research has been done and much 
has been written about the design of 
algorithms and architectures suitable for 
such structures. Today, the idea of a sys¬ 
tolic array is as familiar to many computer 
scientists and engineers as that of a com¬ 
piler or a microprocessor. 

The term ‘ ‘array” originates in the sys¬ 
tolic array’s resemblance to a grid in which 
each point corresponds to a processor and 
a line corresponds to a link between 
processors. As regards this structure, sys¬ 
tolic arrays are descendants of array-like 
architectures such as iterative arrays, 2 cel¬ 
lular automata, 3 and processor arrays. 4 
These architectures capitalize on regular 
and modular structures that match the 
computational requirements of many 
algorithms. Table 1 is a list of applications 
for which systolic designs are available. 
Systolic arrays belong to the generation of 
VLSI/WSI (Very Large Scale Integra¬ 
tion/Wafer Scale Integration) architec¬ 
tures for which regularity and modularity 
are important to area-efficient layouts. 

Although the array structure character¬ 
izes the interconnections in systolic arrays, 
it is the term “systolic” that captures the 
innovative and distinctive behavior of 
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these systems. “Systolic” in this context 
means that pipelined computations take 
place along all dimensions of the array and 
result in very high computational through¬ 
put. In other words, systolic algorithms 
schedule computations in such a way that 
a data item is not only used when it is input 
but also is reused as it moves through the 
pipelines in the array. This results in 
balancing the processing and input/output 
band widths, especially in compute-bound 
problems that have more computations to 
be performed than they have inputs and 
outputs. Conventional processor designs 
are often limited by the mismatch of input 
bandwidth and output bandwidth, which 
occurs because data items are read/writ¬ 
ten every time they are referenced. 

One reason for choosing “systolic” as 
part of the term “systolic array” was to 
draw an analogy with the human circula¬ 
tory system, in which the heart sends and 
receives a large amount of blood as a result 
of the frequent and rhythmic pumping of 
small amounts of that fluid through the 
arteries and veins. In this analogy, the 
heart corresponds to a source and destina¬ 
tion of data, such as a global memory, and 
the network of veins is equivalent to the 
array of processors and links. Another 
explanation of the term is that in many of 
the first proposed systolic architectures, 
processing elements alternated between 
cycles of “admission” and “expulsion” of 
data—much in the same way that the heart 
behaves with respect to the pumping of 
blood. 

In the article “Why Systolic Architec¬ 
tures?” 5 H.T. Kung presents an excellent 
introduction to the basic ideas, the advan¬ 
tages, and the open problems of systolic 
arrays. Today, this article is still essential 
reading for those interested in learning the 
fundamentals of systolic arrays. Our intro¬ 
duction endeavors neither to replace nor 
to repeat the contents of that pioneering 
article. However, it is appropriate to 
elaborate briefly on the three factors that 
characterize systolic arrays as they were 
originally proposed, namely technology, 
parallel/pipelined processing, and appli¬ 
cations. These factors also identify the rea¬ 
sons for the success of the concept, namely 
cost-effectiveness, high performance, and 
the abundance of applications for which 
systolic arrays can be used. 

Technology and cost-effectiveness. 

Nowadays, mature VLSI/WSI technology 
permits the manufacture of circuits whose 
layouts have minimum feature sizes of 1 to 


3 microns. The effective yields of 
VLSI/WSI fabrication processes make 
possible the implementation of circuits 
with up to half a million transistors at 
reasonable cost—even for relatively small 
production quantities. However, the 
advantages of this technology are not fully 
realized unless simple, regular, and modu¬ 
lar layouts are used. Systolic arrays 
attempt to meet these topological con¬ 
straints by using simple processing ele¬ 
ments that, together with a simple 
interconnection pattern, are replicated 
along one or more dimensions. Cost, 
regularity, and modularity are factors 
leading to the design and optimization of 
individual processing elements and their 
respective interconnections. Considera¬ 
tion of these three factors indicates that 
processor arrays are cost-effective engi¬ 
neering solutions to the problem of build¬ 
ing systems with many processing 
elements. 

The main difference between the design 
of systolic arrays and that of other inte¬ 
grated systems of comparable complexity 
is illustrated in a general way in Figure 1. 
The Y-chart shown in the figure is a con¬ 
venient and succinct description of the 
different phases of the process of design¬ 
ing VLSI systems. 6,7 The axes of the Y- 
chart correspond to orthogonal forms of 
system representation, and the arrows rep¬ 
resent design procedures that translate one 
representation into another. A top-down 
design procedure (that is, one that 
progresses from more complex compo¬ 
nents to simpler subcomponents) can also 
be indicated—by arrows drawn along each 
axis and pointed toward the origin. While 
many different design approaches and— 
their corresponding Y-charts—are possi¬ 
ble, design is typically carried out through 
successive refinements. In this process, a 
component’s functional specification is 
translated first into a structural represen¬ 
tation and then into a geometrical descrip¬ 
tion in terms of smaller subcomponents; 
the functional description of each of these 
subcomponents must then be translated 
into structural and geometrical descrip¬ 
tions in terms of even smaller parts, and so 
on. The line arrows shown in the figure are 
intended to convey, in a general way, the 
flow of this process for systolic arrays 
versus more conventional systems. Since 
a systolic array consists of a large number 
of a few types of modules, the process of 
refining the overall system and designing 
every subcomponent is faster and simpler 
than it is in systems with the same size but 
a much larger number of module types. 


Table 1. Applications for which systolic 
designs are available. 


Signal and Image Processing and 
Pattern Recognition 
FIR, HR filtering, and ID 
convolution 

2D convolution and correlation 
Discrete Fourier Transform 
Interpolation 

ID and 2D median filtering 
Geometric warping 
Feature extraction 
Order statistics 

Minimum-distance classification 
Covariance matrix computation 
Template matching 
Seismic signal classification 
Cluster analysis 
Syntactic pattern recognition 
Radar signal processing 
Curve detection 
Dynamic scene analysis 
Image resampling 
Scene matching 

Matrix Arithmetic 
Matrix-matrix multiplication 
Matrix triangularization 
QR decomposition 
Sparse-matrix operations 
Solution of triangular linear systems 
Non-Numeric Applications 
Data structures—stacks and queues, 
sorting 

Graph algorithms—transitive closure, 
minimum spanning trees 
Connected components 
Language recognition 
Dynamic programming 
Arithmetic arrays 
Relational database operations 
Algebra 


This is conveyed graphically in Figure 1 by 
means of large arrows showing that in the 
design of a systolic array, one can proceed 
faster and more directly to the design of 
lower-level components of the system than 
in traditional design. 

Commercially available systolic-array 
chips with 10 to 100 simple, 1-bit proces- 
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Figure 1. A Y-chart that shows the process of designing algorithmically specified 
VLSI digital systems. 


sors exist; these chips sell for less than one 
hundred dollars apiece. Other chips, 
including microprocessors and digital¬ 
processing chips, both of which can be 
used as building blocks in systolic arrays, 
are also available—at even lower cost. Sys¬ 
tolic arrays with thousands of processors 
can be built by assembling many such 
building blocks (chips) at total prices that 
range from ten thousand to a hundred 
thousand dollars and depend on the com¬ 
plexity of each processor. 


Parallel/pipelined processing. Systolic 
arrays derive their computational effi¬ 
ciency from multiprocessing and pipelin¬ 
ing. Multiprocessing is a natural 
consequence of the activities going on 
simultaneously in various processing ele¬ 
ments of the array. Pipelining can be 
thought of as a form of multiprocessing 
that optimizes resource utilization and 
takes advantage of dependencies among 
computations. In systolic arrays, data 
pipelining reduces the input/output- 
bandwidth requirements by allowing a 
data item to be reused once it enters the 


array. Typically, inputs enter the array 
through peripheral processing elements 
and are propagated to neighboring 
processing elements for further process¬ 
ing. These movements of data through the 
array take place both along a fixed direc¬ 
tion in which a link exists between neigh¬ 
boring processing elements and in a 
periodic manner. 

In addition to data pipelining, systolic 
arrays are also characterized by computa¬ 
tional pipelining, in which information 
flows from one processing element to 
another in a prespecified order. This infor¬ 
mation can be interpreted by the receiver 
as data, control, or a combination of both. 
Each output is computed by the 
execution—at different times and in a 
predetermined sequence—of several oper¬ 
ations in a number of processing elements; 
the execution is performed in such a way 
that the output generated by one process¬ 
ing element is used as an input by a neigh¬ 
boring processing element. While 
operations can occur as data flows through 
each processor, the overall computation is 
not a dataflow computation, since the 
operations are executed according to a 


schedule determined by the systolic-array 
design. After a processing element gener¬ 
ates an intermediate output and sends this 
output to the element’s neighboring 
processing elements, the element computes 
another intermediate output. As a result, 
processing resources are utilized effi¬ 
ciently. In the general case, each process¬ 
ing element can be constructed as a 
pipelined processor. Such construction 
results in the so-called two-level pipelined 
systolic array and in even higher 
throughputs. 

Applications and algorithms. Algo¬ 
rithms suitable for implementation in sys¬ 
tolic arrays can be found in many applica¬ 
tions, such as digital signal and image 
processing, linear algebra, pattern recog¬ 
nition, linear and dynamic programming, 
and graph problems. In fact, most of the 
algorithms in the listed applications are 
computationally intensive and require sys¬ 
tolic architectures for their implementa¬ 
tions when used in real-time environments. 
The acceptance of this fact is evidenced by 
the existence of prototype and production 
systolic arrays for modern real-time digi¬ 
tal signal processing systems. The 
manufacturers of these arrays include, 
among others, companies such as ESL- 
TRW, Hughes, NCR, GE, Hazeltine, and 
Motorola. When systolic arrays were first 
proposed, they were intended for applica¬ 
tions with two important sets of charac¬ 
teristics. First, these applications require 
high throughput and large processing 
bandwidth, possibly at the cost of 
increased response time. In other words, 
it is more important to keep up with the 
flow of data than to generate a set of out¬ 
puts for a given set of inputs as quickly as 
possible. Second, these applications can be 
efficiently supported by algorithms that 
can be implemented on arrays consisting 
of a few types of simple processing ele¬ 
ments; the arrays have simple controls and 
input/output ports in the peripheral 
processing elements. These algorithms are 
characterized by repeated computations of 
a few types of relatively simple operations 
that are common to many input data 
items. Often the algorithms can be 
described by programs with nested loops 
or by recurrence equations that describe 
computations performed on indexed data. 
In addition, the pattern of generation and 
usage of data by different operations dis¬ 
plays some regularity and uniformity, 
which means that the resulting communi¬ 
cation requirements can be met by the 
localized interconnections. 
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Implementation issues 

Given the technical and economic prin¬ 
ciples that assure the soundness of the 
systolic-array concept, one needs to con¬ 
sider the issues involved in implementing 
a system for a specific application. Some 
of these issues are briefly discussed here. 

General-purpose and special-purpose 
systolic systems. Typically, a systolic array 
can be thought of as an algorithmically 
specialized system in the sense that its 
design reflects the requirements of a spe¬ 
cific algorithm. However, it may be desir¬ 
able to design systolic arrays that are 
capable of efficiently executing more than 
one algorithm for one application or more. 
Two approaches are possible in designing 
these “large-purpose” systems, and a 
compromise between the two is often 
found in many actual implementations. 
One approach is based on adding hard¬ 
ware mechanisms so as to reconfigure the 
topology and interconnection pattern of 
the systolic array and to emulate the 
requirements of a specialized design. A 
concrete example of this approach is the 
Configurable Highly Parallel computer 
(CHiP), 8 which has a programmable lat¬ 
tice of switches for reconfiguration pur¬ 
poses. The other approach uses software 
to map different algorithms into a fixed- 
array architecture. As is the case with the 
approach behind other general-purpose 
parallel computers, this approach may 
require the use of programming languages 
capable of expressing parallel computa¬ 
tions, as well as the development of trans¬ 
lators, operating systems, and pro¬ 
gramming aids. These requirements apply, 
for example, in the case of Warp, 9 a sys¬ 
tolic array developed at Carnegie Mellon 
University. For each algorithm, the 
designer needs to identify the efficient sys¬ 
tolic designs and mappings and the appro¬ 
priate techniques to use. The issue of 
appropriate techniques is of great impor¬ 
tance, since the final performance, cost, 
and correctness of the design are governed 
by these techniques. 

Design and mapping techniques. To 

synthesize a systolic array from the 
description of an algorithm, a designer 
needs a thorough understanding of and 
familiarity with the principles behind four 
things: systolic computing, the applica¬ 
tion, the algorithm, and the technology. 
Such skilled designers can provide excel¬ 
lent heuristic designs for important 


algorithms. However, the process is slow 
and error prone and may require extensive 
simulations, and the resulting designs are 
not guaranteed to be optimal or correct. 
Progress has been made in the develop¬ 
ment of systematic design techniques to 
automate this process. 10 These techniques 
are unlikely to replace the designers com¬ 
pletely; instead, they will provide tools and 
formal concepts to assist designers in 
searching for diverse and desirable designs 
for a given application. Most of these tech¬ 
niques are concerned with the derivation 
of a relatively high-level specification of 
the array architecture from a description 
of the algorithm. Typically, such a speci¬ 
fication includes the size and topology of 
the array, the operations performed by 
each processing element, the order and 


Many specialized 
arrays can be seen as 
hardware 

implementations of a 
given algorithm. 


timing of data communication, and inputs 
and outputs. To a limited extent, these 
techniques can take into account techno¬ 
logical factors and the relationship of the 
systolic array itself to the rest of the sys¬ 
tem. However, they are not complete; they 
can only be used at the specification 
level—and only in an indirect manner 
there. Until more is learned about design 
techniques that can be used conveniently 
for detailed integration of system and tech¬ 
nology, such integration problems will 
continue to be left for the designer to solve. 

Granularity. The basic operation per¬ 
formed in each cycle by each processing 
element in the various systolic arrays can 
range from a simple bit-wise operation, to 
word-level multiplication and addition, 
and even to execution of a complete pro¬ 
gram. The choice of granularity is deter¬ 
mined by the application, or the 
technology, or both. For example, appli¬ 
cations that use algorithms with basic bit- 
level operators and data structures natu¬ 
rally suggest that processing elements be of 
a corresponding complexity. The same 
choice of processing elements might, how¬ 
ever, result from considerations such as 
input/output-pin restrictions and the tech¬ 
nology that may be used. In programma¬ 
ble systolic arrays, the granularity may 
also be determined by trade-offs between 


the desired degree and level of program¬ 
mability. The Saxpy Matrix-1" is an 
example of a programmable systolic com¬ 
puter with large granularity, whereas bit- 
level systolic arrays, like those discussed by 
J.V. McCanny and J.G. McWhirter, 6 are 
special-purpose designs with low 
granularity. 

Extensibility. Many specialized systolic 
arrays can be regarded as hardware 
implementations of a given algorithm. 
This view holds when there is a direct cor¬ 
respondence between the operations and 
variables of the algorithm and, respec¬ 
tively, the processing elements and wire 
links of the systolic array. In such a case, 
the systolic processor can execute only a 
given algorithm that is designed for a prob¬ 
lem of a specific size. If one wishes to exe¬ 
cute the same algorithm for a problem of 
a larger size, then either a larger array must 
be built or the problem must be parti¬ 
tioned. The first approach is easy to con¬ 
ceptualize and simply requires that more 
processing elements be used to construct 
an enlarged version of the original array. 
However, as regards implementation, one 
must remember that there may be factors 
that do not affect performance in small 
arrays but might affect it in larger systems. 
These factors include clock synchroniza¬ 
tion, reliability, power requirements, chip- 
size limitations, and input/output-pin 
constraints. 

Clock synchronization. In large syn¬ 
chronous systolic arrays, clock lines of 
different lengths can introduce clock 
skews and may require that a slower clock 
be used. Possible approaches that avoid 
this problem of clock skews include 
designing systolic arrays that do not allow 
data to flow in opposite directions and 
using efficient layouts of the clock distri¬ 
bution network. 12 An alternative to the 
design of a globally synchronous array is 
to achieve a self-timed system through the 
use of asynchronous handshaking 
mechanisms established between neigh¬ 
boring processing elements. These self- 
timed implementations are commonly 
referred to as wavefront arrays . 13 

Reliability. Simple laws of probability 
can be used to explain why increasingly 
large arrays are decreasingly reliable unless 
redundancy is incorporated and fault- 
tolerance mechanisms are available. In 
fact, the reliability of an array of proces¬ 
sors is equal to that of a processor raised 
to a power of the number of processors in 
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the array. Since the reliability of a proces¬ 
sor is a value less than one, the reliability 
of the global array quickly approaches 
zero as the number of processors increases. 
Fault tolerance requires that faults be 
detected and located so that faulty process¬ 
ing elements can be replaced by opera¬ 
tional spares through an appropriate 
reconfiguration scheme. A fault-tolerant 
systolic array may need additional hard¬ 
ware to meet these requirements. In addi¬ 
tion, if time redundancy is used or system 
operation needs to be suspended for test¬ 
ing purposes, the fault-tolerant array can 
be slower than the original one. A good 
fault-tolerant design has as its goal max¬ 
imizing reliability while minimizing the 
corresponding overhead. In systolic 
arrays, possible approaches to fault toler¬ 
ance include simple extensions of well- 
known techniques used in conventional 
digital systems. However, these techniques 
do not take advantage of the characteris¬ 
tics of either systolic arrays or the 
algorithms they execute. Novel and suc¬ 
cessful, though general, fault-tolerance 
schemes 14 that take advantage of these 
characteristics have been proposed for sys¬ 
tolic arrays. 

Partitioning of large problems. When it 
is necessary to execute a large problem 
without building a large systolic array, the 
problem must be partitioned so that the 
same algorithm can be used to solve the 
smaller problem and so that an array of 
small, fixed size can be used. The main 
concerns are to avoid rendering the parti¬ 
tioned algorithm incorrect and to avoid 
increasing the complexity of the design sig¬ 
nificantly. One approach identifies algo¬ 
rithm partitions and an order of execution 
of these partitions such that correctness is 
preserved and the original array can be 
used to execute each partition. 15 The per¬ 
ceived result of this approach is that the 
array “travels” through the set of compu¬ 
tations of the algorithm in the right order 
until it “covers” all the computations. 
Another approach attempts to restate the 
problem to be solved so that the problem 
becomes a collection of smaller problems 
that is similar to the original one and that 
can be solved by the given systolic array. 16 
While this second approach has less gener¬ 
ality and is harder to automate than the 
first approach, it may have better perform¬ 
ance when it is applicable. 

Automated design tools. The processing 
elements and module libraries play an 
important role in making the process of 


designing special-purpose arrays of 
processing elements faster and more cost- 
effective. In addition to the many existing 
tools for designing VLSI and WSI systems 
that can be readily used in this process, the 
regularity and algorithmic nature of sys¬ 
tolic arrays permits the use of high-level 
silicon compilers. 7 At this time, the devel¬ 
opment process is not fully automated; the 
process will depend on future progress in 
design automation and computer-aided 
design tools. 

Universal building blocks. Systolic 
arrays cost less to implement than other 
arrays because of their extensive replica¬ 
tion of a small number of simple, basic 
modules and because of their highly dense 
and efficient layouts. It is worthwhile for 


Integrating systolic 
arrays into existing 
systems may be 
nontrivial because of 
I/O bandwidth. 


the simple building blocks to be carefully 
designed and optimized, since the costs 
involved are amortized over a large num¬ 
ber of replicated circuits. The modular 
design of systolic arrays allows designers 
who want rapid prototyping of their ideas 
to use off-the-shelf devices, such as 
microprocessors, floating-point arithmetic 
units, and memory chips. However, these 
parts may not be designed for implement¬ 
ing systolic arrays and may therefore be 
inadequate to meet the design require¬ 
ments. This has led to the development of 
“universal building blocks”—chips that 
can be used for many systolic arrays. The 
cost of such development is, therefore, 
amortized over replicated modules in 
many arrays rather than concentrated in 
simply one array. Commercially available 
chips that are worthy of consideration as 
basic modules include the INMOS Trans¬ 
puter, the TITMS32010 and TMS32020, 
the NEC dataflow chip pPD 7281, Analog 
Devices’ ADSP2100, the Fujitsu MB8764, 
and the National LM32900. Problems 
involved in the use of programmable 
building blocks include developing pro¬ 
gramming tools to aid designers and 
providing support for flexible intercon¬ 
nections. 

Integration into existing systems. 

Although systolic arrays provide extensive 


throughput, their integration into existing 
systems may be nontrivial because of the 
extensive input/output bandwidth 
involved, especially when a problem has to 
be partitioned and input data have to be 
accessed repeatedly. Additional problems 
that have to be solved for systems with a 
large number of systolic arrays include the 
interconnections with the host, the mem¬ 
ory subsystem to support the systolic 
arrays, the buffering and access of data to 
meet the special input/output data distri¬ 
butions, and the multiplexing and demul¬ 
tiplexing of data when there are 
insufficient input/output ports. The prob¬ 
lems that must be faced are exemplified by 
Mosaic, 17 a project being carried out at 
ESL. The system consists of a statically 
scheduled crossbar switch that connects 
multiple Warp processors, each with local 
memory modules, into a macropipeline. 
The local memory modules are used to 
store input data and restructure them into 
the required input format. 


The future 

By the year 2000, it will be possible to 
build integrated circuits with one billion 
transistors—more than one thousand 
times the number of devices available in 
today’s densest integrated circuits. 18 
These incredibly large circuits will use 
0.1—micron geometries made possible by 
advanced optical, electron-beam, ion- 
beam, or X-ray lithography. While the 
high cost of setting up integrated-circuit 
factories that can handle these technolo¬ 
gies will certainly impact the initial cost per 
chip, the main manufacturing limitations 
will be in the design, verification, testing, 
and packaging of such large circuits. In 
addition, the percentage of the chip area 
dedicated to interconnections could 
increase to more than 80 percent. Systolic 
arrays will take advantage of submicron 
technologies without suffering from the 
problems just mentioned, since they are 
modular, have regular interconnections, 
and are extensible. By the year 2000, 
mature design and programming tools and 
extensive knowledge of suitable applica¬ 
tions and algorithms will probably render 
systolic arrays the architecture of choice 
for submicron circuits designed for digital 
signal processing, fast arithmetic, sym¬ 
bolic processing, and intelligent databases. 

Systolic arrays have triggered extensive 
related work and research in the areas of 
processor-array architecture, algorithm 
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design and analysis, and parallel program¬ 
ming. These areas are often identified as 
systolic architecture, systolic algorithms, 
and systolic computing, respectively. As a 
consequence, the principles behind systolic 
arrays have gained an enlarged scope. That 
is, systolic architectures are not necessar¬ 
ily arrays of processors; systolic 
algorithms may be very complex and may 
not necessarily be executed in simple 
processing elements; and systolic comput¬ 
ing can take place in computers other than 
systolic architectures. The prominent fea¬ 
tures of systolic arrays are the processing 
elements, which implement processes, and 
the regular interconnection of multiple 
processing elements. The processing ele¬ 
ments and the interconnection of process¬ 
ing elements can be implemented in 
software, general-purpose microproces¬ 
sors, or specialized hardware. Because of 
this variety of implementation possibili¬ 
ties, systolic arrays have, since the late 
seventies, evolved to become cellular com¬ 
puting at the algorithmic, programming, 
architectural, and hardware levels. We 
are, therefore, witnessing a trend in which 
systolic computing is becoming a pervasive 
form of multiprocessing. □ 
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Wavefront Array 
Processors—Concept to 
Implementation 

S.Y. Kung, S.C. Lo, S.N. Jean, and J.N. Hwang 
University of Southern California 


Most signal and image 
processing algorithms 
can be decomposed 
into computational 
wavefronts that can be 
processed on pipelined 
arrays. 


T he supervisory overhead incurred 
in general-purpose supercom¬ 
puters often makes them too slow 
and expensive for real-time signal and 
image processing. To achieve a through¬ 
put rate adequate for these applications, 
the only feasible alternative appears to be 
massively concurrent processing by 
special-purpose hardware—by array 
processors. Progress in VLSI technology 
has lowered implementation costs for large 
array processors to an acceptable level, 
and CAD techniques have facilitated 
speedy prototyping and implementation of 
application-oriented (or algorithm- 
oriented) array processors. 

Digital signal and image processing 
encompasses a variety of mathematical 
and algorithmic techniques. Most signal 
and image processing algorithms are 
dominated by transform techniques, con¬ 
volution and correlation filtering, and cer¬ 
tain key linear algebraic methods. These 
algorithms possess properties such as 
regularity, recursiveness, and locality, and 
these properties can be exploited in array 
processor design. With VLSI it becomes 
feasible to construct an array processor 
that closely resembles the flow graph of a 
particular algorithm. This type of array 
maximizes the main strength of VLSI- 
intensive computing power—and yet cir¬ 
cumvents its main weakness—restricted 
communication. 

0018-9162/87/0700-0018$01.00©I987IEEE 


Parallel processing architectures. SIMD 
(single-instruction, multiple-data-stream) 
computers, MIMD (multiple-instruction, 
multiple-data-stream) computers, systolic 
arrays, and wavefront arrays are popular 
multiprocessors. It is important to clarify 
their similarities and differences. 

SIMD arrays. An SIMD array is a syn¬ 
chronous array of processing elements 
(PEs) under the supervision of a single 
control unit. 1 All PEs receive the same 
instruction broadcasted from the control 
unit but operate on different data sets 
from distinct data streams. Broadcasting 
of data is usually allowed in an SIMD 
array (see Figure la). 

MIMD arrays. An MIMD computer 
consists of a number of PEs, each with its 
own control unit, program, and data. 1 
The main feature of an MIMD machine is 
that the overall processing task can be dis¬ 
tributed among the PEs to increase pro¬ 
cessing parallelism. An MIMD machine 
may encounter communication bottle¬ 
necks when multiple PEs attempt to simul¬ 
taneously access shared system resources. 
Nevertheless, the flexibility of the MIMD 
architecture often makes it essential for 
dealing with irregularly structured 
algorithms. A dataflow machine, an 
MIMD computer in which an instruction 
is ready for execution as soon as its oper- 
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ands arrive, offers a solution to the prob¬ 
lem of efficiently exploiting concurrency 
of computation on a large scale, and it is 
compatible with modern concepts of pro¬ 
gram structure. 2 

Systolic arrays. Two popular special- 
purpose VLSI array architectures are sys- 


A systolic array is a network of processors 
that rhythmically compute and pass data 
through the system. A systolic array is 
often algorithm-oriented and is used as an 
attached processor (i.e., with a host com¬ 
puter). It features the important properties 
of modularity, regularity, local intercon¬ 
nection, a high degree of pipelining, and 
highly synchronized multiprocessing. An 
extensive literature on systolic array 
processing exists; the reader is referred to 
Fisher and Kung 3 and the references 
therein. 


massive concurrency derived from pipeline 

Processing 


Processing 


Processing 

processing or parallel processing or both. 

unit 


unit 

* # # 

unit 


Interconnection network (local) 


Wavefront arrays. The data movements 
in systolic arrays are controlled by global 
timing-reference “beats. ” The burden of 
synchronizing an entire systolic computing 
network becomes heavy for very large 
arrays. A simple solution is to take advan¬ 
tage of the dataflow computing principle, 
which is natural to signal processing 
algorithms and which leads the designer to 
wavefront array processing. There are two 
approaches to deriving wavefront array 
algorithms: one is to trace and pipeline the 
computational wavefronts; the other is 
based on a data flow graph (DFG) model. 
Conceptually, the requirement for correct 
timing in the systolic array is now replaced 
by a requirement for correct sequencing in 
the wavefront array. 



(b) 


Figure 1. A mesh-type SIMD array (a); a systolic/wavefront array (b). 


Comparisons. To highlight the charac¬ 
teristic differences among the architectures 
cited above, we propose a classification as 
shown in Figure 2. Note that a systolic 
array has local instruction codes and that 
external data are piped into the array con¬ 
currently with the processing. SIMD and 
wavefront arrays can be regarded as some¬ 
what more complex than systolic arrays. 
An SIMD array has control (instruction) 
buses and data buses (in lieu of the local 
instruction codes adopted in systolic 
arrays). A wavefront array, on the other 
hand, provides data-driven processing 
capability. MIMD multiprocessors gener¬ 
ally offer all the features just mentioned, 
possibly with an additional feature— 
shared memories. 

A mesh-type SIMD array is shown in 


Figure la, while a systolic/wavefront 
array is shown in Figure lb. Note that an 
SIMD array usually loads data into its 
local memories before the computation 
starts, while systolic and wavefront arrays 
usually pipe data from an outside host and 
also pipe the results back to the host. Dew 
and Manning 4 compare SIMD arrays and 
systolic arrays for a vision preprocessing 
application. They report that local win¬ 
dowing operations can be effectively 
implemented on both systolic and SIMD 
arrays. Flowever, for data-dependent 
operations such as a binary search corre¬ 
lator, the utilization of the SIMD array 


will be inferior to that of the systolic array. 
The efficiency of the systolic or wavefront 
array is due to the fact that the host han¬ 
dles image storage and can select the 
desired data and pipe them into the array. 

The wavefront array combines the sys¬ 
tolic pipelining principle with the dataflow 
computing concept. In fact, the wavefront 
array can be viewed as a static dataflow 
array that supports the direct hardware 
implementation of regular dataflow 
graphs. Exploitation of the dataflow prin¬ 
ciple makes the extraction of parallelism 
and programming for wavefront arrays 
relatively simpler. 
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Figure 2. Classification of SIMD 
machines, MIMD machines, systolic 
arrays, and wavefront arrays. 
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Figure 3. Wavefront processing for 
matrix multiplication. 
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Wavefront array 
processor 

An approach to deriving wavefront 
arrays is to trace the computational 
wavefronts and pipeline these fronts on the 
processor array. “Computational 
wavefront” means smooth data move¬ 
ment in a localized communication net¬ 
work. The computing network serves as a 
data-wave-propagating medium. A 
wavefront in a processor array cor¬ 
responds to a mathematical recursion in an 
algorithm. Successive pipelining of 
wavefronts through the array will accom¬ 
plish the computation of all recursions. 

For example, let us examine how a 
matrix multiplication algorithm can be 
executed on a square, orthogonal// X N 
wavefront array (Figure 3). Let A = {»„} 
and B = {b tJ } and C = A x B 
= {cy}, and let all be N X //matrices. 
The matrix A can be decomposed into 
columns A, and the matrix B into rows B y , 
and therefore 

C = Aj * B, + A 2 * B 2 

+ . . . + A n * B n (1) 

where the product A, * B, is the “outer 
product.” The matrix multiplication can 
then be carried out in N sets of wavefronts 
(recursions), each executing one outer 
product: 

C <*> = C <* - *) + A* * B* (2) 

or, equivalently, 

■cft*efj-»+ar>xbf> (3) 

where df = a ik and b { f = a kj , for k= 
1, 2__ N. 

Let us now examine the computational 
wavefront for the first recursion in matrix 
multiplication. The elements of A are 
stored in memory modules to the left (in 
columns), and those of B in memory mod¬ 
ules on the top (in rows). The process starts 
with PE (1,1), where c^^cJi' + firn * b n 
is computed. The computational activity 
then propagates to the neighboring PEs, 

(1.2) and (2,1), which execute 
their respective operations. The next front 
of activity will be at PEs (3,1), (2,2), and 

(1.3) . Thus, a computation wavefront that 
travels down the processor array is 
created. Once the wavefront sweeps 
through all the cells, the first recursion is 
complete. As the first wave propagates, we 
can execute an identical second recursion 
concurrently by pipelining a second 
wavefront immediately after the first one. 


For example, the (1,1) processor will exe¬ 
cute + * b 2 i .... and so 

on. 

The separate roles of pipelined and par¬ 
allel processing become evident when we 
carefully inspect how computational 
wavefronts that are to be processed in par¬ 
allel are pipelined successively through the 
processor array. The pipelining is feasible 
because the wavefronts of two successive 
recursions never intersect. That is, differ¬ 
ent processors are used to execute differ¬ 
ent recursions at any given instant. The 
computational wavefronts are similar to 
electromagnetic wavefronts, since each 
processor acts as a secondary source and 
is responsible for the activation of the next 
front. This means that the computation is 
data-driven. 

Note that the major difference between 
a wavefront array and a systolic array is 
the data-driven property. There is no 


A wavefront array 
equals a systolic 
array plus dataflow 
computing. 


global timing reference in a wavefront 
array, and yet the order of task sequenc¬ 
ing is correctly followed. In the wavefront 
architecture, the information transfer 
between a PE and its immediate neighbors 
is by mutual convenience. Whenever data 
are available, the transmitting PE informs 
the receiver, and the receiver accepts the 
data whenever required. It then commu¬ 
nicates with the sender to acknowledge 
that the data have been consumed. This 
scheme can be implemented by means of 
a simple handshaking protocol 5 which 
ensures that the computational wavefronts 
propagate in an orderly manner instead of 
crashing into one another. Since there is no 
need to synchronize the entire array, a 
wavefront array is truly architecturally 
scalable. 

On the other hand, a wavefront array 
and a systolic array are identical in terms 
of regularity, modularity, local intercon¬ 
nection, and pipelinability. They both con¬ 
sist of modular processing units with 
regular and (spatially) local interconnec¬ 
tions. Moreover, their computing net¬ 
works may be extended indefinitely. They 
exhibit a linear-rate speedup; that is, they 
achieve an 0(M) speedup in terms of 


processing rates, where Mis the number of 
PEs. 

In summary, a simple way to relate the 
wavefront array to its systolic counterpart 


Wavefront array = systolic array 
+ dataflow computing. 


Algorithm mapping 
and programming 

VLSI array processor technology is 
steadily advancing, and it is presently in 
transition from the research phase to the 
development phase. Therefore, we shall 
examine both the fundamental principles 
established by research and the implemen¬ 
tation issues critical to development. 

Mapping algorithms to systolic and 
wavefront arrays. As long as communica¬ 
tion in VLSI remains restricted, locally 
interconnected arrays will be of great 
importance. An increase of efficiency can 
be expected if the algorithm arranges for 
a balanced distribution of work load while 
observing the requirement for locality, 
that is, for short communication paths. 
Such a load distribution and information 
flow serves as a guideline to the designer 
of VLSI algorithms and eventually leads 
to new architectural and language designs. 

Given an algorithm, how can an array 
processor be systematically derived? A 
fundamental issue is how to express par¬ 
allel algorithms in a notation that is easy 
to understand but yet can be compiled into 
efficient VLSI array processors. The ulti¬ 
mate design should begin with a powerful 
algorithmic notation that expresses the 
recurrence and parallelism associated with 
the description of the space-time activities. 
This description should be able to be con¬ 
verted into a VLSI hardware description 
or into executable array processor machine 
codes. 

A VLSI algorithm is often very regular 
and the computation activities are express¬ 
ible in terms of a simple grid model, as 
shown in Figure 4a. The computation is 
represented by nodes and the dependency 
of the computational nodes is represented 
by arcs. Such representation is termed the 
dependence graph, or DG, of the algo¬ 
rithm. A DG is a directed graph, which is 
embedded in an index space and specifies 
the data dependencies of an algorithm. In 
a DG, nodes represent computations and 
arcs specify the data dependencies between 
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Figure 4. Illustration of a linear projection with projection vector d (a); a linear 
schedule vector s and its hyperplanes (b). 


computations. In our notation, with 
respect to a dependence arc, the terminat¬ 
ing node depends on the initiating node. In 
deriving an array for a given algorithm, we 
first derive a localized DG from the algo¬ 
rithm and then map the DG to a systolic 
array or directly to a wavefront array. 

DG design. From the initial sequential 
description of an algorithm, we can derive 
a DG for that algorithm by first convert¬ 
ing it to a single assignment form, in which 
any variable is assigned a unique value 
once in the algorithm. The single assign¬ 
ment form of an algorithm can show the 
data dependencies in it clearly. For regu¬ 
lar and recursive algorithms, the DGs will 
also be regular and can be represented by 
a grid model; therefore, the nodes can be 
specified by simple indices such as ( i,j,k ). 

A mapping methodology is used for 
mapping uniform (that is, shift-invariant) 
DGs onto processor arrays. (A DG is shift- 
invariant if the dependence arcs cor¬ 
responding to all the nodes in the index 
space do not change with respect to the 
node positions.) Matrix multiplication, 
convolution, autoregressive filtering, dis¬ 
crete Fourier transforms, discrete 


Hadamard transforms, Hough trans¬ 
forms, least squares solutions, sorting, 
perspective transforms, LU decomposi¬ 
tion, and QR decomposition all belong to 
this algorithm class. By exploiting the 
regularity of such algorithms, we can 
greatly simplify the array processor design 
for them. 

A straightforward implementation of a 
DG is to assign each node in the DG to a 
PE. This is not efficient, since each PE 
only executes one computation in the algo¬ 
rithm. Therefore, we would like to let each 
PE execute multiple nodes in the DG and 
yet retain all the parallelism in the DG. 
This calls for a mapping from the DG to 
the array processor. In the following, we 
describe mapping DGs to both systolic and 
wavefront arrays. 

Mapping DGs to systolic arrays. In 
mapping a uniform (shift-invariant) DG to 
a systolic array, we need to specify the 
node assignment and the schedule for the 
DG. 

The node assignment specifies how the 
nodes in the DG are assigned to the PEs in 
the array. A linear assignment (projection) 
of a DG is a linear mapping of the nodes 


of the DG to the PEs, in which nodes along 
a straight line are mapped to a PE. The 
projection direction is denoted by a vector 
d (see Figure 4a). 

The schedule specifies the execution 
time for all the nodes in the DG. The 
scheduled execution time of a node is rep¬ 
resented by a time index (that is, by an 
integer). A linear schedule, denoted by s, 
maps a set of parallel equitemporal hyper¬ 
planes to a set of linearly increased time 
indices, where s is the normal vector of the 
equitemporal hyperplanes (see Figure 4b). 
That is, the time index of a node can be 
mathematically represented by s T i, where 
i denotes the index of the node. 

For a systolic array to be obtained, the 
projection vector d and the schedule vec¬ 
tor s have to satisfy two constraints: 

• s T e>0. Here e denotes any edge in 
the DG. The number s r e denotes the 
number of delays (Ds) on the edge of the 
systolic array. The schedule vector s must 
obey the data dependencies of the DG; 
that is, if node i depends on the output of 
node j, then j must be scheduled before i. 

• s r d>0. The projection vector d and 
the schedule vector s cannot be orthogonal 
to each other; otherwise, sequential 
processing will result. 

In systolic mapping, the following rules 
are adopted: 

• The nodes in the systolic array must 
correspond to the projected nodes in the 
DG. 

• The arcs in the systolic array must cor¬ 
respond to the projected components of 
arcs in the DG. 

• The input data must be projected to 
the corresponding arcs in the systolic 
array. 

Figure 5a shows a mapping of a DG to a 
systolic array. The DG in Figure 5, while 
not explicitly specified, represents a 
convolution-like algorithm. 

Mapping DGs to wavefront arrays. Due 
to its dataflow nature, a wavefront array 
does not have a fixed schedule. Therefore, 
the operation of a wavefront array is dic¬ 
tated only by the data dependency struc¬ 
ture and the initial data tokens. A 
wavefront array can be modeled by a 
dataflow graph, or DFG. A DFG is a 
weighted, directed graph 

DFG = [N, A, D(a), Q(a), x(n)] 

in which nodes N model computation and 
arcs A model communication links. Each 
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Figure 5. Mapping a DG to a systolic array (a); mapping a DG to a DFG 
(wavefront array) (b). 


node n has an associated nonnegative real 
weight t(«) representing its computation 
time. Each arc a is associated with a non¬ 
negative integer weight, D(a), representing 
the number of initial data tokens on the 
arc, and a positive integer weight, Q(a), 
representing the FIFO queue size of the 
arc. A node is enabled when all input arcs 
contain tokens and all output arcs contain 
empty queues. A node fires after it has 
been enabled for its computation time. 
Whenever a node fires, one input token is 
taken away from each input arc, and each 
output arc from the node is assigned one 
more token. 

There exists a systematic way to map 
DGs to DFGs. Recall that for a shift- 
invariant DG, some of its boundary nodes 
may appear to have a different depen¬ 
dency structure (e.g., fewer dependency 
arcs) than that of the internal nodes. For 
our mapping, it is necessary to enforce a 
uniform appearance by assigning some 
initializing data (usually a constant, e.g., 
zero) to the boundary nodes of the DG. 
After this is done, all the nodes have the 
same dependency arcs (see Figure 5b), and 
all the data input to boundary nodes are 
viewed as input data. 

For the shift-invariant DG and a given 
projection direction d, we can derive the 
DFG in a manner similar to the systolic 
mapping. Each input data token in the DG 
is mapped to an initial token on the cor¬ 
responding arc in the DFG. Here the queue 
size for each DFG arc is assumed to be 
large enough to accommodate the target 
algorithms. An example of mapping a DG 
to a DFG (with its initial tokens) is shown 
in Figure 5b. In contrast to the systolic 
mapping shown in Figure 5a, the DFG 
mapping does not need any schedule vec¬ 
tor s, since the data-driven computing 
nature of the wavefront array obviates the 
need to specify the exact timing. Further¬ 
more, based on the dataflow principle, an 
optimal schedule implied by the DG will be 
automatically followed. This is explained 
below. 

Assume that each DG node is assigned 
to one PE and that all the input data are 
available, so that minimum computation 
time can be achieved. Suppose the projec¬ 
tion direction d is chosen so there is a strict 
dependency among the nodes that are 
mapped to the same PE. Thus, the sequen¬ 
tial processing among these nodes by the 
single PE should not in any way impose an 
extra slowdown in the execution time, and 
hence the resulting DFG can compute the 
same computation in minimum time. This 


provides a simple guideline for the selec¬ 
tion of the projection direction d. (In fact, 
this rule is also useful for the systolic map¬ 
ping.) Note that this guideline may be 
generalized to cover the nonhomogeneous 
DG and nonlinear assignment situations. 6 
A nonlinear assignment is a good choice if 
the nodes assigned to each PE have a strict 
data dependency. Note that a nonlinear 
assignment can be easily implemented on 
a programmable wavefront array. 


Sometimes the nodes in a DG may have 
different execution times and these times 
may (or may not) depend on the input 
data. For a systolic design, such timing 
uncertainty will prevent the designer from 
seeking an optimal schedule. In this case, 
a wavefront design will be more appealing 
from a speed point of view. Analyzing the 
exact performance of a wavefront array, 
which is data-driven and sometimes data- 
dependent, is very difficult. A method that 
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provides an upper bound on the execution 
time of wavefront arrays for sparse matrix 
multiplications has been proposed by 
Melhem. 4 

Queues in the DFG. Using queues is a 
way to implement asynchronous commu¬ 
nication in a wavefront array. Actually, a 
queue is a mechanism for storing and 
retrieving data. Queues can be imple¬ 
mented by software or hardware. For high 
processing speed, such as is required by a 
wavefront array, hardware queues are pre¬ 
ferred. However, queues can also be 
implemented with memories by software, 
which has the advantage that queue 
lengths are not limited. 

In the above discussion, we have 
assumed that the queues in a wavefront 
array (or DFG) are large enough for the 
target algorithms. Insufficient queue size 
usually results in an additional slowdown 
of the computation. Therefore, it is natu¬ 
ral to ask how to determine the minimum 
queue size required for each DFG arc so 
the minimum computation time can be 
achieved. For simplicity, the DG node 
computation times are assumed to be data- 
independent. Hence, the DG can be sched¬ 
uled a priori and the minimum computa¬ 
tion time can be determined. Recall that 
each DFG arc represents a number of DG 
arcs. Suppose that a DG arc a is projected 
onto a DFG arc a'. To determine the mini¬ 
mum required queue size for a ', we note 
the following: 

• The scheduled completion time, t\, 
for the initiating node of a indicates when 
the output data of the node are produced 
(or put) on a'. 

• The scheduled completion time, t 2 , 
for the terminating node of a indicates 
when the data are consumed from a'. 
Apparently, if t is the node computation 
time, then t 2 - ty + t represents the 
length of time a data token stays in a ’ and 
its two end nodes. 

• The pipelining period, a, which is the 
time period between two consecutive data 
being put on a ', can be determined from 
the schedule. 

Thus, the queue size for a ', Q, can be cal¬ 
culated as 

q = r& - t t + t)/«i 

where f-1 denotes the ceiling function. 

If the queue size of a wavefront array is 
less than the minimum required one, then 
the overall speed of the array will be 
slowed down. In the general case, when the 


DG is not shift-invariant or the node times 
are different, then the projected DFG is 
not regular. Kung, Lewis, and Lo provide 
a detailed analysis of the timing of the 
DFG and the minimization of queues for 
optimal throughput. 7 Their work is based 
on timed Petri net theories. 

Programming. In this discussion, we 
assume that wavefront arrays have f ixed 
interconnections between PEs and & fixed 
queue length for each data link. Program¬ 
ming a wavefront array means specifying 
the sequence of operations for each PE. 
Each operation includes the following 
specifications: 

• the type of computation (addition, 
multiplication, division, and so on), 

• the input data link (north, south, east, 
west, or internal register), and 

• the output data link. 

Note that an additional specification relat¬ 
ing to when an operation actually occurs 
is required when one is programming sys¬ 
tolic arrays with fixed interconnections. 
This is needed to ensure the correct timing 
of the computations in these arrays, but it 
is not required in wavefront array pro¬ 
gramming because of the wavefront 
array’s data-driven nature. In this sense, 
programming a wavefront array is easier 
than programming a systolic array. From 
the viewpoint of algorithm mapping (or 
automatic language translation from a 
program on the host to a program for the 
array), both wavefront and systolic arrays 
require an assignment of the computa¬ 
tional nodes of the DG to the PEs. For sys¬ 
tolic arrays a (time) scheduling of 
computations is also necessary. Since a 
wavefront array has inherent self-timing, 
it needs no scheduling. In fact, it will adopt 
the optimal schedule. 

A programming language for a 
wavefront array should be able to express 
parallel data-driven computing. One good 
example of such a language is Occam, 8 
which is designed to be the programming 
language for the Inmos Transputer and 
which is, essentially, a high-level language. 
Another language that can be used for a 
wavefront array is MDFL, the Matrix 
Data Flow Language, 5 which uses the 
wavefront notion to reduce the complex¬ 
ity of parallel programming. Wavefront 
array programming is quite straightfor¬ 
ward for those algorithms that are already 
expressed in terms of DFGs. For example, 
many 1-D or 2-D digital filters can be ini¬ 
tially given in the DFG form. The pro¬ 
grammer just needs to assign one PE to 


each DFG node, if possible, and write pro¬ 
grams to execute the node functions. If the 
number of PEs is less than the number of 
DFG nodes, the programmer can group 
several DFG nodes and assign them to a 
single PE. 

A matrix multiplication example. We 
can now give an example of Occam pro¬ 
gramming for a two-dimensional 
wavefront array for matrix multiplication. 
The computing network serves as a (data) 
wave propagating medium (see Figure 3 
again) that can be implemented by using 
the “channels” in Occam. 

For data input, one matrix enters the 
array of processors from the left column, 
while the other matrix enters from the top 
row. As the data values move right and 
down they are multiplied and accumu¬ 
lated. Finally, after the entire matrix has 
passed through the array, each processor 
contains an element of the final matrix. 
Again, all the PEs perform the same tasks 
of reading, multiplying, accumulating, 
and transmitting the data further right and 
down. 

An Occam program for the main pro¬ 
gram, which specifies the array structure, 


CHAN vertical[n*(n + 1)]: 

CHAN horizontal[n*(n+ 1)]: 

PAR i = (0 FOR n] 

PAR j = [0 FOR N] 
mult (vertical [(n * i) + j ], 
vertical[(n*i) + j + 1], 
horizontal[(n*i) + j], 
horizontal[(n*(i + 1)) + j)]): 

The process mult , which describes the PE 
operations, is called n x n times and is 
shown below: 


PROC mult (CHAN up, down, left, right) 
VAR acc, a, b : 

SEQ 
acc : = 0 

SEQ i = [0 FOR n] 

SEQ 
PAR 
up ? a 
left ? b 

acc: =acc +a*b 
PAR 
down ! a 
right ! b : 


More programming examples are 
presented in Kung et al. 5 and Kung. 6 
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System architecture 
design 

The overall architecture of an array 
processor system can be divided into a 
hierarchy of levels: 

• The system architecture level defines 
how the system appears to users and appli¬ 
cation programmers. It includes the 
characteristics of languages, the user inter¬ 
face, and the operating system. 

• The array architecture level defines 
the interconnections between different 
arrays and the functional capabilities of 
the processors comprising the arrays. 

• The PE architecture level defines the 
hardware modules for the PE nodes. It 
includes both the instruction implementa¬ 
tion (fetch, decode, or execute mechanism) 
and the interfaces between individual 
building blocks (e.g., bus widths, queue 
lengths). 

It should be noted that at all architec¬ 
tural levels the model of execution should 
be defined. All levels do not necessarily 
share the same model of execution. For 
example, we may have a dataflow model 
at the system level, allowing the user to see 
the system as a collection of processes and 
processors that operate on data availabil¬ 
ity, whereas we may have a control flow 
model at the underlying levels (e.g., the 
PEs may be conventional von Neumann 
machines). Or the system level may have 
a control flow model of execution (e.g., be 
programmed in Fortran) and the underly¬ 
ing levels may adopt a dataflow model of 
computation (e.g., the PEs may be NEC 
m PD 7281 dataflow chips). In the 
wavefront array system proposed below, 
we adopt the dataflow model for the array 
architecture level and control flow model 
for the PE architecture level. 

A wavefront array processor may be 
used either as an attached processor inter¬ 
facing with a compatible host machine or 
as a stand-alone processor equipped with 
a global control processor. A system con¬ 
figuration for an array processor is pro¬ 
posed in Figure 6. The system consists of 
the following major components: 



Figure 6. An array processor system consists of a host, an array control unit, an 
interface system, an interconnection network, and a processor array. 


• processor array(s), 

• interconnection network(s), and 

• a host computer and interface unit. 

In general, desirable features for an 
array processor system are high speed, 
flexibility, reliability, and cost- 
effectiveness. Let us now explore the key 


architecture design issues for each of the 
above components. 

Processor arrays. A processor array 
comprises a number of PEs linked by a 
network with a regular topology. For fixed 


array structures, the 1-D (linear), 2-D 
(mesh or hexagonal), and 3-D cube- 
connected networks are the most popular. 
Many algorithms can indeed be mapped to 
these regular arrays. Other variants such 
as hypercubes 9 and shuffles 10 are also 


July 1987 


25 






































































































Figure 7. The proposed handshaking circuit, with glitch protection ability (a); the 
timing diagram of this circuit (b). 


DR = Data ready 
DA = Data available 
DP = Data processed 
DS = Data sent 


DU = Data used 
POR = Power on reset 


becoming popular. The choice of array 
structure depends on the communication 
required by the given algorithms and appli¬ 
cations. 

Dynamically interconnected or recon- 
figurable array structures allow an array 
to support a large class of algorithms. Such 
structures usually involve significant hard¬ 
ware overhead. However, various strate¬ 
gies for improving the speed and flexibility 
of (global) interconnection networks are 
available. Reconfigurability of array struc¬ 
tures based on switching lattices has been 
proven to be useful for solving problems 
related to fault tolerance. A typical exam¬ 
ple is the CHiP project, 11 which uses a 


programmable switch lattice embedded in 
the PE array. 

PE architecture. The PE architecture 
should meet the requirements of the 
intended computing tasks. The key 
requirements for DSP applications, for 
example, are adequate word length, a fast 
multiply and accumulate, high-speed 
RAM, and fast coefficient table address¬ 
ing. The functionality of the PE should be 
designed to support these needs. There are 
four main components in the PE: 

• Arithmetic and logic unit. Since high 
throughput is usually demanded of a 
wavefront array processor, the ALU must 


rapidly compute frequently encountered 
operations. Fixed-point ALUs are cheaper 
to build, but floating-point ones provide 
higher precision and dynamic range, which 
are often required by DSP applications. 

• Memory unit. Designs with separate, 
on-chip program and data memories are 
now becoming popular among digital sig¬ 
nal processors. Although on-chip memo¬ 
ries are smaller than off-chip ones, they do 
allow faster processing. 

• Control unit. There are two 
approaches to control unit design. The 
first is the reduced instruction set com¬ 
puter (RISC) approach, which uses a small 
set of simple instructions and obtains a 
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simple control unit with a higher clock 
rate. The second is the complex instruction 
set computer (CISC) approach, which uses 
a large and complex instruction set and 
allows complicated tasks to be completed 
with fewer instructions. The current trend 
in VLSI implementations appears to be 
toward the RISC approach. 

• I/O unit. The PE should be able to 
perform data transfers concurrently with 
processing. For each of the communica¬ 
tion links, the transfer of data is controlled 
by a separate I/O controller, which han¬ 
dles the two-way handshaking functions. 

Asynchronous communication pro¬ 


tocols. One of the most important features 
that distinguish a wavefront array proces¬ 
sor from other array processors is the data- 
driven operation of each of its PEs. To 
ensure correct sequencing and data trans¬ 
fers between adjacent PEs, handshaking 
protocols must be adopted to synchronize 
the operations. There are two types of 
asynchronous communication schemes: 
the one-way control scheme and the two- 
way control scheme. In the one-way con¬ 
trol scheme, the sender sends data without 
waiting for the acknowledgment signal of 
the receiver. This method is suitable for a 
wavefront array processor only when large 
buffers are provided. The two-way control 


scheme, usually known as handshaking, is 
preferable for most wavefront array 
processors. A proposed handshaking cir¬ 
cuit is shown in Figure 7a. This circuit can 
be considered an improved version of a 
previous design. 5 This new design is more 
robust because its flip-flops are driven by 
internal clocks, and it is less sensitive to the 
glitch noise encountered in the communi¬ 
cation links. The timing diagram of this 
circuit is shown in Figure 7b. Two rising- 
edge-triggered JK flip-flops and two 
falling-edge-triggered D flip-flops plus two 
AND gates are used to implement the cir¬ 
cuit. The basic operations and protocols 
during one handshaking cycle are as 
follows: 

(1) When DE = landCKl = falling, 
then DR = 1 (for one clock cycle) and 
data are on the bus. 

(2) When DR = 1, DU1 = 1, and 
CK1 = rising, then DS1 = 1 and 
DE = 0. 

(3) When DS1 = 1, Q* = 1, and 
CK2 = falling, then DS2 = 1. 

(4) When DS2 =1, DP = 0, and 
CK2 = rising, then DA = 1 and 
DU2 = 0, and data are latched on PE2. 

(5) When DA = landCK2 = falling, 
then DP = 1 (for one clock cycle) and 
data are used. 

(6) When DS2 = 1, DP = 1, and 
CK2 = rising, then DA = 0 and 
DU2 = 1. 

(7) When DU2 = landCKl = falling, 
thenDUl = 1. 

(8) When DU1 = 1, DR = 0, and 
CK1 = rising, then DS1 = 0 and 
DE = 1. 


If we define the handshaking communi¬ 
cation overhead to be equal to the time 
interval between the rising edge of the DR 
flag and that of the DP flag, this time inter¬ 
val is determined by several delay factors: 
the flip-flop time delay d ff , the propaga¬ 
tion time delay d pr , the phase difference 
between CK1 and CK2, and the delays 
introduced by the DP flag response after 
DA is set high. On average, this overhead 
is shorter than three clock periods, and in 
most cases both PEs continue their com¬ 
putations during the handshaking time 
interval. When the granularity of PEs is 
large (which is true for wavefront array 
processors), the handshaking time penalty 
is less significant. 

Block handshaking scheme. One way to 
reduce the handshaking overhead is to use 
a block handshaking scheme, in which a 
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Figure 8. Three configurations for the interconnection network in an array processor system: intra-array communication— 
among PEs (a); intra-array communication—between PEs and memory blocks (b); inter-array communication—between 
arrays and global memory blocks (array control units not shown) (c). 


block of data can be transmitted and 
received in only one handshaking opera¬ 
tion. This method is useful for communi¬ 
cation between systems operating with 
almost identical clock frequencies. The 
success of the block handshaking scheme 
relies on the assumption that the clock fre¬ 
quency and phase of the sending PE 
remain stable during the period of the 
block data transfer. 

Two-level pipelining. In many cases fur¬ 
ther speed enhancement can be obtained 
by using pipelining in the ALU of the PE. 
If this is done, the handshaking scheme of 
participating PEs is the same as before. 
However, due to the uncertainty of a con¬ 
tinuous supply of data into the pipe, the 
PE should record the time at which data 
enter the pipe so it can retrieve the 
processed data from the pipe later. If there 
are no data coming into the pipe, then the 
corresponding output is considered gar¬ 
bage. A simple way to implement this 
scheme is to employ a one-bit shift regis¬ 
ter of the same length as the pipeline in the 
PE. When valid (or invalid) data come into 
the pipe, a 1 (or a 0) is entered into the shift 
register. This 1 (or 0) is shifted in the shift 
register as the data move down the pipe. 
When the data come out of the pipe, the 
1 (or 0) is shifted out of the register and 
used to gate the output data. In this way, 
the PE can retrieve the valid processed 
data. 

Interconnection networks. The inter¬ 
connection networks used in processor 
arrays can significantly affect their speed. 
Certain structured (intra-array) intercon¬ 
nection networks can be incorporated into 


a PE array to provide direct, global, and 
high-speed communications. There are 
two suitable ways to configure an intra¬ 
array interconnection network. 

The first configuration (Figure 8a), in 
which a network is used to support direct 
communication among PEs, is appropri¬ 
ate when the number of PEs is equal to the 
number of memory blocks. Each PE is 
permanently linked to one and only one 
memory block, and some smart memory 
management must reside in the memory 
blocks to reformat the I/O data. Commu¬ 
nication between PEs is established 
through the interconnection network, 
which can realize various interconnection 
patterns. When several computations in 
different PEs are successively performed 
using the same set of data, this structure 
is very efficient since the memory opera¬ 
tions are usually slower than the functions 
of the PE and are not involved in the trans¬ 
fer of data between different PEs. This 
design is the one that has been chosen for 
most array computers currently in use or 
under development. 12 

The second configuration is often used 
when the numbers of memory blocks and 
PEs are not equal. Usually, the number of 
memory blocks is greater than the number 
of PEs (e.g., in matrix manipulation 
algorithms). Figure 8b illustrates this con¬ 
figuration, in which the interconnection 
network is used to connect one or more 
memory blocks to one or more PEs. 

When there are multiple arrays (or mul¬ 
tiple clusters of PEs) in an array processor 
system, a bottleneck is often created if the 
host machine is asked to handle all the data 
transactions between the arrays. There¬ 
fore, for inter-array communication, a 


global interconnection network is added 
between the global memory blocks and the 
local memory blocks, while for intra-array 
communication, a local interconnection 
network is provided within each array or 
cluster (Figure 8c). 

The configuration shown in Figure 8c, 
a hierarchical version of that shown in Fig¬ 
ure 8b, illustrates this arrangement. In 
each subarray (cluster), a local intercon¬ 
nection network, which links the faster 
local memory blocks and the PE array, 
provides intra-array communication, 
while in the system as a whole, a global 
interconnection network, which talks 
through the local control unit to commu¬ 
nicate with local memory, provides inter¬ 
array communication and also allows 
access to globally shared data. 13 For 
applications in which inter-array commu¬ 
nication is frequent, the local memory and 
local interconnection networks may be 
removed. This allows the global intercon¬ 
nection network to be used for both intra- 
and inter-array communication. However, 
the global interconnection network must 
then provide a complete partition capabil¬ 
ity to allow it to be used as either a local 
(intra-array) or a global (inter-array) 
network. 14 


Host computer. The host computer (or 
the array control unit) 

• provides system monitoring, batch 
data storage, management, and data 
formatting, 

• determines the schedule program that 
controls the interface unit and inter¬ 
connection network, and 
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• generates global control codes and 
object codes for PEs. 

The host can also perform data formatting 
of or data conversion between floating¬ 
point and fixed-point notations, bit-serial 
and bit-parallel representations, and so on. 
Direct memory access schemes are often 
used for fast data transfer between the 
array and the host or the I/O units. The 
host determines and schedules the paral¬ 
lel processing tasks and matches them with 
the available array processor modules. It 
generates control codes to coordinate all 
system units. The array control unit fol¬ 
lows the schedule commands, performs 
data rearrangement, and handles direct- 
data-transfer traffic. Sequence control 
guides the sequencing of operations. Stor¬ 
age management may be specified 
either by the programmer or by the system 
controller. System-controlled manage¬ 
ment is often safe and free from undesira¬ 
ble interference, but its efficiency of 
allocation is low. A stack-based (runtime) 
storage management scheme is a simple 
and useful alternative. 

Operating system. Most conventional 
operating systems support features 
designed to provide a complete program¬ 
ming environment to the user. These fea¬ 
tures remain desirable in array processors. 
They include disk and terminal I/O sup¬ 
ports, resource management facilities (for 
the CPU, arrays, memory, and I/O 
devices), multiuser and multitasking capa¬ 
bilities, and virtual memory management. 

An operating system for an array 
processor involves one major extension to 
a conventional operating system—a driver 


for the arrays which must be able to treat 
the arrays both as a whole (to allocate tasks 
to them) and as collections of processors 
(to load programs and data to individual 
PEs). The information the driver needs to 
perform these tasks is provided by a lan¬ 
guage compiler that performs dependence 
analysis. 

In the environment of a special-purpose 
system, several software tools are needed: 

• program development tools—that is, 
basic tools (editors and file managers) 
like those provided by a conventional 
operating system and additional tools 
(high-level languages and projection 
programs) to assist in mapping and 
matching algorithms to arrays, 

• a testing and debugging tool—that is, 
a target (array) architecture sim¬ 
ulator, 

• code downloading tools—that is, 
tools to program ROMs for existing 
systems, and interfaces to drive CAD 
systems for custom/semicustom 
implementation, and 

• runtime support tools—that is, a 
library of routines to support inter- 
PE and host-PE communications, 
basic control functions, and the man¬ 
agement of global resources such as 
memory and external I/O. 

Interface unit. The interface unit, which 
is connected to the host via the host bus or 
through DMA, downloads, uploads, and 
buffers array data, handles interrupts, and 
formats data. Since an array processor is 
used as an attached processor, the design 
of this unit is important. The interface unit 
is monitored by the host or array control 
unit according to the schedule program. 

Why wavefront 
architectures? 

Both wavefront and systolic arrays 
share the important common feature of 
using a large number of modular and 
locally interconnected processors for mas¬ 
sively pipelined and parallel processing. 
They are, however, different in hardware 
design—most specifically in their clock 
and buffer arrangements—and in pro¬ 
gramming requirements. 

Clock distribution. The clocking 
scheme is a critical factor in large-scale 
array systems. If clocking can be suitably 
handled, then synchronous systems 
usually yield a conceptually simpler 
design. However, when a fast system clock 


is used, global synchronization often 
imposes a heavy burden on the hardware 
because of clock skew. 15 The clock skew 
increases with the size of the array and is 
especially troublesome in two- or higher¬ 
dimensional arrays. Furthermore, syn¬ 
chronization of the transfer of data among 
a large number of PEs may lead to large 
current surges as the components are 
simultaneously energized or change state. 
This problem can be alleviated in 
wavefront arrays, however, because of 
their asynchronous nature. 

Processing speed. Wavefront arrays 
suffer from a fixed time-delay overhead 
resulting from handshaking, although a 
block handshaking scheme can be adopted 
to reduce it. Synchronous arrays, though 
they do not have this problem, can suffer 
a loss of speed—when the processing times 
of the PEs are not uniform, a synchronous 
array may have to accommodate the 
slowest PE by using a slower clock. In con¬ 
trast, wavefront arrays, because of their 
data-driven nature, do not have to hold 
back faster PEs in order to accommodate 
slower ones. Wavefront arrays also yield 
higher speed when the computing times are 
data-dependent. For example, when an 
abundance of “zero” entries are encoun¬ 
tered in sparse matrix multiplications, a 
“trivial” multiplication can be computed 
in much less time than a “nontrivial” mul¬ 
tiplication. Finally, wavefront arrays ben¬ 
efit from various techniques that may be 
adopted to speed up processing time. 16 A 
simulation of systolic and wavefront 
processing for a least squares minimiza¬ 
tion problem showed that the wavefront 
array was almost twice as fast as the sys¬ 
tolic array. 17 

Programming. The programming of 
wavefront arrays is easier than that of sys¬ 
tolic arrays because wavefront arrays 
require only the assignment of computa¬ 
tions to PEs, whereas systolic arrays 
require both assignment and scheduling of 
computations. 

Fault tolerance. Fault tolerance is an 
important concern in systolic and 
wavefront arrays because of the large 
number of PEs they may have. Fault toler¬ 
ance techniques for arrays can be grouped 
into three categories: 

• fabrication-time fault tolerance, 

• compile-time fault tolerance, and 

• runtime fault tolerance. 

Neither systolic nor wavefront arrays pos- 
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Figure 9. Schematic diagram of the PE design for the STC-RSRE wavefront array (reproduced from Davie, Higgins, and 
Cawthorn 16 ). 


sess any special advantages in fabrication¬ 
time and compile-time fault tolerance. 
However, wavefront arrays are superior to 
systolic arrays in runtime fault tolerance. 
Since a wavefront array is a data-driven 
computing network, an individual PE can 
isolate itself from the array by indicating 
to adjacent processors that it is not ready 
to accept data. Hence, when a fault occurs 
in a PE, the PE can isolate itself and per¬ 
form self-testing. If the self-testing indi¬ 
cates that the fault has passed (was 
transient), the PE rejoins the array. If the 
self-testing indicates that the fault is per¬ 
manent, the array is reconfigured to cir¬ 
cumvent the faulty PE. In contrast, when 
a fault occurs in a PE in a systolic array, 
all the PEs in the array must be interrupted 
for self-testing. The occurrence of multi¬ 
ple faults is an additional problem for sys¬ 
tolic arrays, since such faults can cause a 
control-interrupt-line contention problem. 

Wafer-scale integration. WSI technol¬ 
ogy can offer significant performance 
enhancements over the conventional 
method of system building, in which 
individually packaged chips are mounted 
on a printed-circuit board. WSI’s advan¬ 
tages include 

• shorter interconnect distances 
between chips, 

• the ability to mix semiconductor tech¬ 
nologies, 


• faster system clock rates, and 

• the ability to perform dynamic inter¬ 
connection by means of switching 
lattices. 

Wavefront architectures’ local intercon¬ 
nections and data-driven nature make 
WSI particularly attractive for use with 
them. With today’s technologies, the 
development of WSI wavefront array 
processors is a realistic goal. The regular¬ 
ity, modularity, and data-driven nature of 
wavefront architectures should be able to 
be exploited to devise suitable fault toler¬ 
ance techniques for WSI wavefront arrays. 
The combination of wavefront techniques 
and WSI technology appears very promis¬ 
ing for future high-speed supercomputing. 

Examples of wavefront 
array systems 

Here, we will examine two wavefront 
array systems that have actually been con¬ 
structed. We will also explore the possibil¬ 
ity of utilizing commercially available 
VLSI microprocessors such as the NEC 
pPD7281 dataflow chip and the Inmos 
Transputer for constructing wavefront 
arrays. 

STC-RSRE wavefront array processor 
system. The Standard Telecommunica¬ 
tions Company and the Royal Signals and 


Radar Establishment in Britain have 
jointly developed a wavefront array 
processor system for adaptive beamform¬ 
ing. The system is reconfigurable for many 
distributed array processing appli¬ 
cations. 16 

PE design. The system’s PE is based on 
the TMS32010, with additional hardware 
to provide multiple I/O ports, a floating¬ 
point ALU, lookup tables, localized con¬ 
trol, and a bit-serial diagnostic link. The 
PE includes program ROM, containing 
fixed algorithmic code, and program 
RAM, providing space in which to down¬ 
load programs for other algorithms. The 
schematic diagram of the PE is shown in 
Figure 9. 

Array. When used for adaptive beam¬ 
forming, the STC-RSRE system consists 
of 33 identical PEs, 21 of which are 
organized as a triangular wavefront array, 
which performs the adaptive beamform¬ 
ing function by means of the QR algo¬ 
rithm, and 12 of which perform data 
correction and other preprocessing func¬ 
tions (Figure 10). The system has been suc¬ 
cessfully applied to many real-time 
experiments. It is supported by an RF/IF 
(radio frequency/intermediate frequency) 
front-end subsystem, a zero-IF receiver, 
and A/D converters. The processor system 
is housed in a four-foot-high equipment 
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Figure 10. Overall configuration of the STC-RSRE adaptive array system (reproduced from Davie, Higgins, and Cawthorn 16 ). 


rack with power supplies and fans; its total 
power consumption is approximately 500 

A VLSI node processor. Any future 
realization of wavefront arrays for real¬ 
time signal processing requires the devel¬ 
opment of specialized VLSI processors to 
enable more compact implementation of 
systems and provide a throughput rate 
matched to modern radar, communica¬ 
tions, and electronic countermeasure sys¬ 
tems. As a part of the British Ministry of 
Defence’s Very High Performance Inte¬ 
grated Circuit Program, recent work at 
STC has been directed toward the defini¬ 
tion of a high-performance VLSI “node 
chip” that can serve as a programmable 
building block for real-time wavefront 
array subsystems. 16 

A number of key features have been 
included in the node chip specification: 

• a high-throughput processor, 

• on-board I/O control, program and 
data memory, arithmetic units, and 
parallel multiplier blocks, 

• full handshake control to allow sim¬ 
ple interconnection with neighboring 
node chips, 

• an internal architecture optimized for 
floating-point arithmetic, and 

• programmability. 

The node chip overlays computation 


and dataflow by using FIFO buffering on 
both its input and output ports. By allow¬ 
ing computation and dataflow to take 
place concurrently, the node chip ensures 
maximum dataflow. The application of 
sophisticated algorithms to real-time sig¬ 
nal processing problems often requires 
that processing parameters be varied in 
accordance with the required system per¬ 
formance or with environmental condi¬ 
tions (for example, that the dynamic 
bandwidth of a digital filter be controlled 
by rapid adjustment of the filter weighting 
coefficients). In addition to the usual data- 
I/O ports, the node chip provides ports for 
receiving control information from an 
external supervisory processor. The con¬ 
trol data is propagated in synchronism 
with the data wavefronts passing through 
the array. 

Memory-linked Wavefront Array 
Processor. The MWAP is a new wavefront 
array processor developed at the Applied 
Physics Laboratory of the Johns Hopkins 
University. 18 As a result of its use of 
VHSICs (very high speed ICs), the 
MWAP achieves very high performance. 
Multiple MWAPs are connected on a ring 
network to form a large processing system. 
The MWAP not only serves as a module 
on a ring bus but also uses a modular struc¬ 
ture for its PEs. The basic MWAP archi¬ 
tecture (Figure 11.) consists of a bus 


interface to a dual-port memory, multiple 
PE/dual-port-memory pairs, and a bus 
output interface. 

Each MWAP PE consists of a control 
unit, an instruction unit, an instruction 
cache, a block of memory address 
registers, a floating-point multiplier, and 
a floating-point ALU. Once the instruc¬ 
tion cache is loaded, program and data 
memory are separate. The key to this 
architecture is the memory addressing 
structure. All memory addressing is done 
by reference to a memory address register, 
which can be read, or read and then 
incremented, or read and then reset to a 
base address set up during program load¬ 
ing. The PE can simultaneously read or 
write data memory in both directions. Syn¬ 
chronization of the PEs in the MWAP is 
done by the right and left control flags in 
the memory addressing unit. A PE can 
always read the memory to either its right 
or left. However, it can only write into 
these memories when the control flag for 
them is reset. Each PE can set or reset the 
memory control flag in either memory. If 
the PE attempts a store to a memory that 
has its control flag set, program execution 
is suspended until the control flag is reset; 
then the store command is executed. In 
addition, the setting of a control flag in a 
memory causes an idle PE to begin pro¬ 
gram execution. This memory control 
structure results in high-speed, compact 
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Figure 11. Basic architecture of the MWAP (reproduced from Dolecek 18 ). 


programs with complete MWAP dataflow 
control. 

Parallel 24-bit floating-point arithmetic 
was chosen for the prototype MWAP to 
allow a broad range of algorithms to be 
implemented. The dynamic range of 
MWAP floating-point numbers is 
1.4 x 10~ 39 tol.7 x 10 38 , which simpli¬ 
fies data scaling. The MWAP floating¬ 
point multiplier is capable of 10 million 
multiplications per second. And the TI 
VHSIC static RAM chip used in the cur¬ 
rent MWAP dual-port memory supports 
about 10 million accesses per second per 
PE. Thus, for real-number filtering oper¬ 
ations, memory speed and multiplication 
speed are matched. 

Some possible wavefront architectures. 

A wavefront array processor can be con¬ 
sidered a static dataflow machine and thus 
can be implemented with dataflow PEs. 
Two commercially available VLSI chips, 
the NEC pPD7281 dataflow chip 19 and 
the Inmos Transputer, can be considered 
for this. Both provide a set of powerful 
processing primitives and a handshaking 
mechanism for communications. 

Constructing a wavefront array with 
NEC chips. NEC’s pPD7281 is aprogram- 
mable device that features a dataflow 
architecture. It can help improve the trade¬ 
off between speed and flexibility. The 
pPD7281 uses an internal circular pipeline 


and a powerful instruction set to allow 
high-end image processing. The disadvan¬ 
tage of this chip is that it was initially 
designed for linear arrays or rings—not for 
the wavefront array’s mesh-connected, 
Illiac-IV-type architecture—and thus has 
only one input and one output bus. This 
problem can be solved by designing an 
interface that permits connection of a 
pPD7281 PE to its four neighbors. 

Constructing a wavefront array with 
Inmos Transputers. Inmos’s Transputer 
chip (T414 or T424) is an Occam- 
language-based design that provides hard¬ 
ware support for both concurrent compu¬ 
tation and communication. The chip is a 
complete computer with four neighbor 
connections. It has been designed so that 
its external behavior corresponds to the 
formal model of an Occam process. It 
adopts the now-popular RISC architec¬ 
ture. It has a 32-bit processor capable of 
10 MIPS, 4K bytes of 50-ns static RAM 
and, significantly, a variety of communi¬ 
cations interfaces. These features make it 
a powerful building block for constructing 
concurrent processing networks. The 
Transputer’s links are the hardware repre¬ 
sentation of the channels for process com¬ 
munication. There is an intimate 
relationship between Transputer channel 
links and the communication protocol 
envisaged for wavefront arrays. Trans¬ 
puter nets programmed in Occam may be 


quite naturally regarded as wavefront 
processors. However, the Transputer is 
significantly limited in many signal 
processing applications. A very useful 
enhancement of the Transputer would be 
the inclusion of a floating-point capability. 

S ignal and image processing appli¬ 
cations generally call for 
algorithms that are deterministic in 
both time and space. This allows develop¬ 
ment of a unified theoretical framework 
for architecture and algorithm design. The 
wavefront array is the outcome of the use 
of such a framework. It eliminates the 
need for global control characteristic of 
the systolic array and thus permits a dis¬ 
tributed and data-driven approach to the 
manipulation of the complex data depen¬ 
dencies in array processing. Moreover, it 
can easily cope with various programming 
requirements, with the need for fault toler¬ 
ance, and with computing networks hav¬ 
ing either global or irregular 
interconnections. The power and flexibil¬ 
ity of the wavefront array are demon¬ 
strated by its very broad application 
domain, including adaptive array process¬ 
ing, image and vision processing, seismic 
analysis, and medical signal processing. 
The trend toward the use of algorithmi¬ 
cally specialized parallel computers— 
including both systolic and wavefront 
arrays—will continue. Such machines will 
play a large role in future supercomputing 
technology. □ 
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The Matrix-1 employs 
a programmable and 
reconfigurable systolic 
array that achieves 
nearly gigaflop 
performance for 
problems in signal 
processing and matrix 
computation. 
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T here has been no reduction in the 
demand for higher performance 
in scientific computation. This is 
especially true in digital signal processing 
because new algorithms for that applica¬ 
tion require much more computation than 
do conventional techniques. The increas¬ 
ing volumes of data from seismic surveys, 
sonar and radar systems, and imaging sys¬ 
tems need to be processed; these applica¬ 
tions are important sources of demand for 
more throughput. The analysis of data 
that are generated by such new imaging 
technologies as magnetic resonance imag¬ 
ing creates additional demand. 

In signal processing and scientific com¬ 
puting, parallelism is pervasive. While the 
granularity of the parallelism may vary, 
the opportunity to perform many calcula¬ 
tions in parallel is characteristic of it. Har¬ 
nessing this inherent, abundant parallelism 
through new computer architectures is 
now one of the major thrusts in computer 
technology. There are many parallel 
architectures that attempt to exploit it. 1 ' 3 

The wide variety of parallel-computer 
designs reflects the broad range of oppor¬ 
tunities for exploiting parallelism. In many 
cases, the architecture has been specialized 
for a particular type of parallel operation. 
This is often apparent in the arrangement 
and control of, and the interconnections 
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between, processing elements and 
memory. 

For the most part, the architectures pro¬ 
posed and built to exploit parallelism have 
been multiple-instruction, multiple-data 
(MIMD) architectures. In theory, MIMD 
architectures allow any parallel computa¬ 
tion to be performed. Several MIMD 
machines (for instance, the Cray XMP and 
the Sequent Balance 21000) consist of a 
few independent computational units that 
share memory. Others feature a loose 
coordination of multiple processors under 
the control of a single central processor (as 
in the case of the Intel iPSC). 

Other machines, such as the Connection 
Machine, the IBM GF11, and the 
Matrix-1, operate in a tightly coupled, 
single-instruction, multiple-data (SIMD) 
fashion. Within this group there is tremen¬ 
dous architectural diversity. Massively 
parallel SIMD architectures such as the 
Connection Machine employ hundreds or 
thousands of simple processors, while 
moderately parallel machines such as the 
Matrix-1 use fewer processing elements 
(the Matrix-1 uses up to 32 fast processing 
elements). The GF11 is intended for a spe¬ 
cific application; the other two machines 
are general-purpose computers. The Con¬ 
nection Machine features a distributed 
memory and a near-hypercube intercon- 
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Figure 2. The Matrix Processor unit subsystem. 


nection network; the GF11 has a Memphis 
switch connecting its processors to its 
memory modules; the Matrix-1 has a 
shared memory connected by a high- 
bandwidth bus to the ring-connected, lin¬ 
ear processor array. 

As originally conceived, a systolic array 
was a special-purpose planar array of sim¬ 
ple processing elements that featured a 
regular, nearest-neighbor interconnection 
network. In essence, it was a pipelined 
hardware device for some high-level 
mathematical function such as matrix 
multiplication. Such special-purpose sys¬ 
tolic arrays have been proposed for a vari¬ 
ety of applications, including several 
important matrix factorizations, digital 
filtering, and dynamic programming. 4 ' 8 
These designs are very attractive from the 
standpoints of simplicity and efficiency, 
and may ultimately be realized as 
extremely cost-effective VLSI devices. But 
direct VLSI implementation of systolic 
systems is not practical today because of 
the limits to integrated-circuit complexity 
and the high cost of chip design. Thus, 
board-level hardware systems must be 
built to implement systolic arrays. For this 
reason, special-purpose arrays have a seri¬ 
ous economic drawback that provides 
strong motivation for developing general- 
purpose systolic arrays: With special- 
purpose arrays, new hardware must be 
developed for each new or modified appli¬ 
cation. The specialized arrays are obvi¬ 
ously not satisfactory as general-purpose 
scientific computers. 

Warp is a general-purpose systolic sys¬ 
tem. 3 Experience with Warp suggests that 


a programmable systolic array has appli¬ 
cations to a wide variety of scientific com¬ 
putations. For example, in addition to the 
initial use of Warp for computer vision, 
the machine is being successfully employed 
in signal-processing computations and in 
solving partial differential equations. A 
multiple-purpose systolic array such as 
Warp or the Matrix-1 overcomes the draw¬ 
backs of special-purpose hardware only to 
the extent that efficient software for new 
applications can be rapidly developed. The 
development of W2, a high-level language 
for Warp, has simplified this software 
development and has been a critical ele¬ 
ment in its success. 

In this article, we present the architec¬ 
ture of the Matrix-1 and review the issues 
that Saxpy Computer Corp. faced in 
designing it. The key architectural features 
that we consider are the Matrix-1 ’s use of 
an SIMD control hierarchy; the use of a 
large global memory; the use of both sys¬ 
tolic and global data paths in the systolic 
array; the choice of a small number of fast 
processing elements; the use of Fortran as 
the programming language; the emphasis 
on block algorithms; and the provision in 
the hardware of double-buffered, 
software-managed local memory for the 
systolic array to support block algorithms. 

Architecture 

The five principal components of the 
Matrix-1 are illustrated in Figure 1; they 
are 

• the System Controller, a general- 


purpose computer that executes the 
application program and allocates 
Matrix-1 resources; 

• the Matrix Processor, a linear array 
of up to 32 pipelined, floating-point 
processors that have systolic and 
global interconnections; 

• the System Memory, which stores all 
data arrays for use by the Matrix 
Processor; 

• the Mass Storage System, an I/O 
interface that provides access to high¬ 
speed data-storage peripherals; 

• the Saxpy Interconnect, a combined 
control and data bus that links the 
other four units of the Matrix-1. 

System Controller. The System Con¬ 
troller is a DEC VAX that runs VMS. The 
System Controller can compile and link 
the application program. Its main function 
is to execute the application program, 
sending control information across the 
Saxpy Interconnect to control the 
remainder of the hardware. Data are 
stored in the System Memory, and the 
Matrix Processor performs practically all 
floating-point computation; the System 
Controller’s job is to coordinate resources. 

Matrix Processor. The Matrix Proces¬ 
sor subsystem (Figure 2) has three prin¬ 
cipal components: the Matrix Processor 
(the programmable array), the Matrix 
Processor Interface (the data pathway 
between the processor array and the Saxpy 
Interconnect), and the Matrix Control 
Processor (a fast integer processor that 
decodes commands received over the 
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Figure 3. The Matrix 
Processor zone archi¬ 
tecture. 


Saxpy Interconnect from the System Con¬ 
troller and then controls the interface and 
processor array in order to perform the 
desired computation). 

The Matrix Processor (Figure 3) is an 
array of 8,16,24, or 32 vector processors; 
these are called computational zones. It is 
an SIMD architecture: All zones receive 
the same control and address instructions 
at a given clock cycle. The Matrix Proces¬ 
sor can function in systolic mode (in which 
data are transferred linearly across the 
zones) or in block mode (in which all zones 
operate independently and use local data). 
Both operational modes can be used 
within a single subroutine. 

Each zone has a pipelined, 32-bit 
floating-point multiplier; a pipelined, 
32-bit floating-point adder with logic 
capabilities (its ALU); and a 4K-word local 
memory. These components operate on a 
64-ns clock. The multiplier and adder are 
cascaded. Lookup tables are provided for 
use in reciprocal and reciprocal square- 
root calculations. Since each of the zones 
may begin a multiplication and an ALU 
operation with every clock cycle, the peak 
computing rate of the Matrix Processor is 
64 floating-point operations per clock, or 
an average of one per nanosecond, for a 
total of 1 billion floating-point operations 
per second (1000 MFLOPS). The instruc¬ 
tion set of the Matrix Processor zone 
includes a large variety of vector oper¬ 
ations. 

Although the Matrix Processor is an 
SIMD architecture and thus all the zones 
nominally execute the same operations in 
parallel, two of its mechanisms allow the 


programmer more freedom. First, any 
subset of the zones may be disabled by set¬ 
ting the bits of a program-controlled 
mask. This permits single-vector compu¬ 
tation and calculations on a subset of the 
zones. Second, the zone memories allow 
indirect addressing, in which elements of 
one vector are used as pointers into 
another vector. “Scatter” and “gather” 
vector operations are implemented with 
this indirect-addressing hardware and 
operate at half speed—one word every two 
cycles. 

Within the Matrix Processor there are 
three types of data interconnections, all of 
which are one word wide and operate at 
the clock speed of the Matrix Processor. 
First, the systolic data path connects each 
zone to its right neighbor; the last zone can 
connect to the systolic output buffer 
(which is the global buffer) or, in a circu¬ 
lar fashion, to the first zone. The global 
buffer can provide data to the first zone 
over the systolic path, thereby acting as a 
systolic input buffer. Second, the global 
buffer is the source for a broadcast data 
path connected to all the zones. While 
broadcasting is “ systolic heresy’ ’ (since it 
is hard to do in VLSI), it is both useful and 
feasible in the large-scale system we are 
discussing. Third, the local memory of 
each zone has several connections, both to 
the zone’s processing elements and to the 
Matrix Processor Interface. The section 
entitled “The memory hierarchy” (below) 
contains more information on the design 
and use of the Matrix Processor intercon¬ 
nections. 

The Matrix Processor Interface (Figure 


4) mediates between the system data bus 
and the internal buses of the Matrix 
Processor. Four buffers are used to hold 
blocks of data that are in transit between 
System Memory and the Matrix Proces¬ 
sor. The buffers are two-sided and can 
concurrently carry on transfers with Sys¬ 
tem Memory and transfers with the Matrix 
Processor zone memories. Furthermore, 
the buffers support bidirectional transfers 
by allowing a read and a write operation 
in a single clock cycle. The local memories 
of the zones are also bidirectional, two- 
sided and can support data transfers 
involving the Matrix Processor Interface 
concurrently with computational reads 
and writes. Thus, the Matrix Processor 
Interface can mediate bidirectional trans¬ 
fers between System Memory and the 
Matrix Processor in parallel with the 
occurrence of full-speed computation. The 
Matrix Processor Interface buffers are 
connected through a partial crossbar 
switch to the internal buses of the Matrix 
Processor, which permits several patterns 
of communication between the Matrix 
Processor and its interface. In transferring 
data, the interface buffers allow random 
access, which, among other things, per¬ 
mits the programmer to transpose arrays 
“on-the-fly” as they are transmitted 
between System Memory and the Matrix 
Processor. Such a facility allows full-speed 
access to certain arrangements of noncon¬ 
tiguous System Memory locations. 

The Matrix Control Processor executes 
computational subroutines to control the 
flow of data between System Memory and 
the Matrix Processor and to issue the com- 
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Figure 4. The Matrix Processor Interface. 


putational instructions to the zones. For 
example, to multiply two matrices, the 
Matrix Control Processor decomposes 
large data arrays into arrays of 32 X 32 
blocks, issues commands to the Matrix 
Processor Interface that instruct it to move 
blocks as necessary from System Memory 
to the Matrix Processor and back, and 
issues vector-command sequences to the 
Matrix Processor that instruct it to per¬ 
form operations on the 32 x 32 blocks. 
Data-transfer and computational com¬ 
mands are interleaved so as to overlap 
operations for maximum efficiency; the 
resulting concurrent execution allows the 
Matrix Processor to compute at full speed 
without waiting for data. 

In programming the Matrix Control 
Processor, one views the individual zones 
as vector computers. The machine instruc¬ 
tion set is based on a set of vector opera¬ 
tions replicated across all zones. The 
pipelining and the various registers and 
multiplexers in the Matrix Processor zone 
are not seen by the programmer, nor are 
they directly controlled by a Matrix Con¬ 
trol Processor program. Control of these 


elements is achieved by instruction-decode 
hardware and microcode in the Matrix 
Processor. 

The application programmer at the Sys¬ 
tem Controller level need not be aware of 
the implementation details of the com¬ 
putational subroutines. The internal 
details of the Matrix Processor and the 
Matrix Control Processor are hidden from 
the application level; each subroutine 
implements a mathematical operation (for 
example, Fast Fourier Transform or 
matrix multiply) on an entire System 
Memory array. The section entitled “The 
programming model” (below) describes 
the levels at which one may program the 
Matrix-1. 

System Memory. The floating-point 
data processed by the Matrix-1 reside in 
the System Memory, which ranges in size 
from 16M to 128M words. The Matrix-l’s 
ability to store very large data sets in main 
memory can reduce significantly the 
amount of intermediate data transfer to 
auxiliary storage devices. Moreover, the 
Matrix-1 allows only a single job to use the 


System Memory and Matrix Processor at 
a given time, so that every job is guaran¬ 
teed to have all of the available memory. 
Performance is therefore predictable and 
is not reduced by the cost of swapping or 
the other overheads incurred in sharing 
memory. All application and system pro¬ 
grams reside in the System Controller’s 
memory and therefore do not consume 
space in the System Memory. 

The System Memory is not virtually 
addressable. Therefore, no address trans¬ 
lation is needed. Programs exceeding the 
128M-word limit must implement their 
own off-line memory management. 
(General experience in large-scale scientific 
computing indicates that hardware paging 
strategies deserve improvement and that 
users obtain high performance by manag¬ 
ing large data sets themselves.) 

Memory cycle time is 100 ns. A wide- 
word consisting of eight adjacent 32-bit 
words is read or written in each cycle. Suc¬ 
cessive accesses to consecutive wide-word 
locations can occur at the rate of one per 
cycle. Thus, the memory bandwidth is 
80M words per second when one is access¬ 
ing large contiguous blocks of data. 
Reduced bandwidth performance is 
characteristic of several modes of nonse¬ 
quential access, including the random 
wide-word access and single-word access 
modes. The memory includes single-error 
correction and double-error detection on 
pairs of 32-bit words. 

Mass Storage System. The Mass Storage 
System (Figure 5) supports two types of 
I/O activity. First, it makes possible direct- 
memory-access transfers between disk files 
and System Memory. Disk files may be 
striped across several drives to increase 
transfer rates up to the limit of 3M words 
per second. The second type of I/O 
activity implements staged transfers 
between external memory and peripherals 
without using the Saxpy Interconnect. 
Each type of data transfer may be carried 
out in parallel with computations and their 
attendant data movement. 


Design issues for 
parallel processing 

In this section we shall give the reasons 
for some of the decisions Saxpy Computer 
Corp. made in designing the architecture 
of the Matrix-1. 

Issues in designing the Matrix Proces¬ 
sor. Many of the major issues faced in 
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Figure 5. The Mass Storage System. 


designing the Matrix-1 involved the struc¬ 
ture and function of the Matrix Processor. 
The objectives of the Matrix-1 design were 
speed, broad applicability, simplicity, and 
ease of programming. 

The Matrix Processor is an SIMD array, 
which lends it important architectural 
advantages. The control unit and program 
memory need not be replicated. The pro¬ 
gramming model is simple. An SIMD 
design may be synchronous, and the 
Matrix Processor array is. In a syn¬ 
chronous multiprocessor, adjacent 
processing elements can communicate 
data without the overhead that normally 
results from using operating system soft¬ 
ware and an asynchronous hardware inter¬ 
face. In the Matrix Processor, the latency 
of communication with a neighboring 
processor is only one clock cycle of 64 ns. 
This is more than three orders of magni¬ 
tude lower than the latency achieved by 
typical asynchronous MIMD multiproces¬ 
sors such as the Intel iPSC. Finally, in sig¬ 
nal processing and many other areas of 
scientific computing, a wealth of low-level 
data parallelism makes possible the full 
exploitation of the SIMD array. 


The arithmetic elements of the Matrix 
Processor zone are shown in Figure 3. The 
floating-point multiplier and the ALU 
handle 32-bit floating-point data. The out¬ 
put of the multiplier is cascaded into one 
of the ALU inputs in order to reduce the 
peak demand on the zone buffer from six 
to four operands and results per cycle. In 
most matrix computation, signal process¬ 
ing, and other areas of scientific comput¬ 
ing, sums of products are quite common; 
a sum of products may be computed at full 
efficiency if one uses this cascaded proces¬ 
sor arrangement. 

Use of the shared global buffer through 
the global path has several benefits. Since 
global data are not replicated in each zone, 
storage in the zone buffer is free to be used 
for independent local operands. Further¬ 
more, computations using one operand 
from the global path further reduce the 
peak demand on the zone buffer. 

Even with cascaded processing elements 
and use of a global operand, three accesses 
to the zone buffer (two reads and a write) 
are needed every clock cycle. To provide 
this data bandwidth, the zone buffer is 
two-way interleaved. Also, it is imple¬ 


mented in a faster technology (ECL 
SRAM) than the remainder of the zone; 
the technology allows the memory to oper¬ 
ate on a 32-ns clock cycle, which is double 
the zone’s clock rate. 

The memory hierarchy. To support 
computation at 1000 MFLOPS in the 
Matrix Processor, data must be accessed 
at a rate of 2000M words/sec. (There are 
three reads, one write, and two operations 
in every zone during every clock cycle). 
Were this data to be provided by the Sys¬ 
tem Memory, the memory’s speed would 
have to be increased by the ratio 2000:80, 
or 25:1. To accomplish this, a far more 
expensive technology would be needed; in 
fact, no memory this large and fast has 
ever been built. Thus, a memory hierarchy 
is needed. 

Most vector computers employ vector 
registers to buffer data between main 
memory and the computational unit; 
many machines employ cache memories 
for the same purpose. The Matrix-1 takes 
a very different approach in its memory 
hierarchy. 

The Matrix-1 employs an algorithmic 
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solution that exploits the structure of 
scientific computations, especially the high 
degree of data reuse inherent in matrix and 
signal processing. For instance, multipli¬ 
cation of two TV x TV matrices takes 27V 3 
floating-point operations on 3TV 2 data 
accesses, which is roughly an TV-fold reuse 
of data. Furthermore, one can perform 
matrix multiplication by using a block 
algorithm, which operates on small blocks 
of data rather than on scalars. By choos¬ 
ing the block size, the programmer (or the 
compiler) can keep the data being used in 
the fast end of the memory hierarchy and 
thus obtain the desired ratio of computa¬ 
tion to main memory accesses. Systolic 
algorithms also achieve reuse of data and 
thus require less System Memory band¬ 
width. The use of systolic and block 
algorithms effectively reduces by the 
necessary factor of 25 or more the mem¬ 
ory and data-bus bandwidths needed to 
support the Matrix Processor. 

To exploit block and systolic 
algorithms, the Matrix Processor needs a 
block store, a small but fast local memory. 
The program loads into a block store the 
data blocks being used in computation, 
while all other data are kept in the System 
Memory. In the Matrix-1, this block store 
is the zone buffer memory, which holds 
128K words and has a processing-side 
bandwidth of 1500M words/sec. (H.T. 
Kung has given a general analysis of the 
local memory required for the processing 
elements of a systolic array. 9 ) 

The zone buffer memory of the 
Matrix-1 is partitioned into 32 banks, each 
of which is assigned to one of the zones. 
An alternative design is employed in the 
Alliant FX/8, in which a cache is shared by 
the eight processing elements; these ele¬ 
ments are connected to the cache through 
a crossbar. Scaling the Alliant design to a 
large number of processors would, how¬ 
ever, lead to prohibitive growth in the size 
and latency of the cache-to-processor con¬ 
nection. The Matrix Processor employs a 
partial crossbar to transfer data between 
the Saxpy Interconnect and the zone mem¬ 
ories. The Matrix Processor contains a 
global buffer and global broadcast data 
path for common data that are shared by 
all the zones. 

Cache memory is not employed in the 
Matrix-1. In cache memory systems, the 
local memory is under control of the hard¬ 
ware, which fetches data from main mem¬ 
ory after cache misses, or requests for data 
not residing in the cache. With 32 zones 
and 32 caches, the load on memory and 
bus bandwidth would be too great for 


cache memory to be useful in the Matrix-1. 
Therefore, the zone memory is explicitly 
managed by the Matrix Control Processor 
program. This approach allows the pro¬ 
grammer or compiler to make near¬ 
optimum use of the zone memory. 

The Matrix Processor Interface func¬ 
tions primarily as an I/O buffer between 
System Memory and the zone memories. 
The partial-crossbar connection between 
the interface’s four buffers and the zone 
buffers makes possible a variety of data- 
access patterns. However, the Matrix 
Processor Interface can also be viewed as 
a flexible interzone data connection. For 


The System Controller 
is not burdened with 
any low-level control 
function. 


example, the interface facilitates array 
transposition, which can be difficult to 
program on linear systolic arrays. In addi¬ 
tion, the systolic data path provides for 
fast data transfer between adjacent zones. 

Issues in designing the System Control¬ 
ler. The Matrix-1 employs a VAX com¬ 
puter as its System Controller, in much the 
same way that the Warp uses a Sun work¬ 
station. 

The primary consideration in designing 
the System Controller was how to allow a 
relatively slow control processor to direct 
the computation of the extremely fast 
(1000 MFLOPS) Matrix Processor. This 
goal was achieved by a combination of 
three methods: asynchronous execution of 
the System Controller and the Matrix 
Processor, hierarchical control, and stor¬ 
age of data in System Memory. 

The System Controller of the Matrix-1 
controls the computational and data- 
transfer subsystems. The application pro¬ 
gram in the System Controller issues con¬ 
trol packets that are buffered in queues 
maintained by the System Management 
Interface; this command buffering allows 
the System Controller to proceed indepen¬ 
dently of I/O and computation in the 
Matrix Processor. Because the System 
Controller need not pace the rest of the 
hardware exactly, both units can be used 
efficiently. 


The System Controller sends commands 
to the System Management Interface, 
which relays the commands to the 
Matrix Processor or the Mass Storage Sys¬ 
tem on an as-needed basis. Computational 
commands are routed to the Matrix Con¬ 
trol Processor, which then executes the 
appropriate subroutine. The subroutine in 
turn issues commands to the Matrix 
Processor zone array; in effect, these com¬ 
mands decode the System Controller’s 
command. Data-transfer commands from 
the System Controller are similarly trans¬ 
lated into a sequence of lower-level 
instructions. Thus, the System Con¬ 
troller’s interaction with the rest of the 
machine occurs at a high level; it is not bur¬ 
dened with any low-level control function. 

While many attached processors com¬ 
pute using data residing in the host mem¬ 
ory and thus require time-consuming data 
transfers, the Matrix-1 assigns nearly all 
data storage to the System Memory. Lit¬ 
tle or no data reside in the System Control¬ 
ler, so data transfer to and from the 
System Controller rarely causes a bot¬ 
tleneck. 

The programming 
model 

The application program is written in a 
high-level language (such as Fortran or C) 
and runs on the System Controller. It is 
developed with standard VMS program¬ 
ming tools. In order to direct the Matrix 
Processor to operate on the large data 
arrays located in System Memory, the 
application program makes calls to Matrix 
Processor subroutines. Each such subrou¬ 
tine performs a substantial computation 
on an entire array of data in System Mem¬ 
ory (for example, a 1024 x 1024 two- 
dimensional complex FFT). The granular¬ 
ity of the tasks performed by the Matrix 
Processor is quite large (20 million com¬ 
plex multiply-adds would be necessary for 
the computation in the example just 
given), which is why the VAX is able to 
keep up with the Matrix Processor. The 
arguments to library subroutines include 
pointers to arrays of data in the System 
Memory. Dynamic allocation and manip¬ 
ulation of System Memory data arrays are 
performed by a software module that 
resides in the System Controller. 

A library of highly optimized assembly- 
language Matrix Processor subroutines is 
also provided. These routines perform 
many basic functions, such as vector oper¬ 
ations, matrix operations, Fast Fourier 
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Figure 6. Computation of the Cholesky 
factorization (A = LL 7 ) of a symmetric 
positive definite matrix A. A = | |a f y| |? j =i , 
L = | \ljj\\ ij=u and L is lower triangular. 



Figure 7. The Cholesky algorithm that 
appears in Figure 6 is here recast into a 
block algorithm. 


Transforms, and copying data from one 
System Memory array to another. Many 
application programs can be written for 
the System Controller as Fortran pro¬ 
grams with calls to these library subrou¬ 
tines. This programming method offers 
substantial advantages. The details of par¬ 
allel implementation are handled at the 
lowest level, which results in increased per¬ 
formance, while the applications program¬ 
mer uses high-level operators and is not 
burdened with the low-level algorithmic 
and hardware details. 

While the library routines are sufficient 
for many applications, one can often 
improve efficiency by writing special- 
purpose Matrix Processor subroutines. 
These subroutines are written in a subset 
of Fortran. An optimizing compiler, simi¬ 
lar to the vectorizing compilers used for 
vector machines, recognizes inner loops 
and nested inner loops that can be executed 
as parallel, vector operations on an SIMD 
array (the compiler is under development). 
The compiler then generates a Matrix Con¬ 
trol Processor program that moves data 
blocks from the System Memory to the 
zone buffers and global buffer, executes 
parallel vector operations on those data, 
and stores result blocks back into the Sys¬ 
tem Memory. The compiled program 
overlaps block transfers and computation 
for increased processor efficiency. 

There is a second useful programming 
level for the Matrix Processor. An assem¬ 
bly language for the Matrix Processor sub¬ 
system, the Primitive Definition Language 
(PDL), is used at this level. In the PDL, 
memory addresses are used explicitly. The 
programmer is responsible for loading and 
storing data blocks from the System Mem¬ 
ory. He or she uses vector operations 
rather than relying on a compiler to vecto¬ 
rize loops. At this level, the programmer 
may produce his or her own optimized 
Matrix Processor subroutine library. 

The chief drawback of this program¬ 
ming model is that existing code cannot be 
compiled. The programmer of the 
Matrix-1 is required to partition the com¬ 
putation into the part that executes in the 
System Controller and the calls to a prede¬ 
fined library of subroutines in the Matrix 
Processor. An intelligent compiler that 
handles this partitioning for the program¬ 
mer has not yet been developed. 

Algorithms and 
applications 

The Matrix Processor can efficiently 


handle a great number of applications 
through its systolic and block modes. The 
systolic mode allows it to employ all of the 
large number of systolic algorithms that 
have been developed. For example, R.P. 
Brent, F.T. Luk, and C. Van Loan 4,5 
have developed several very elegant sys¬ 
tolic algorithms for the matrix eigenvalue 
and singular value decompositions, which 
are vitally important in signal processing 
and many other areas of scientific and 
statistical computing. Implementation of 
these methods as Matrix Processor library 
subroutines was a relatively straightfor¬ 
ward process. For other algorithms, the 
block mode was quite appropriate. The 
Fast Fourier Transform of multiple data 
vectors was efficiently implemented in 
block mode. 

Block algorithms have been important 
in the development of subroutines for 
matrix computations. A block algorithm 
operates on blocks (blocks are square sub¬ 
matrices) in much the same way that a sca¬ 
lar algorithm operates on individual 
numbers. For example, instead of work¬ 
ing with a matrix as an array of individual 
numbers, a block algorithm treats the 
matrix as a smaller array of blocks of num¬ 
bers. A scalar multiplication of two num¬ 
bers might be replaced by a block-matrix 
multiplication. To illustrate the concept, 
we shall now consider a more detailed 
example of a block algorithm. 

The Cholesky factorization (A = LL 7 ) 
of a symmetric, positive, definite matrix 
A, where A = |k/llfc=i. L= ||yft=i, 
and L is lower triangular, may be com¬ 
puted as shown in Figure 6; in this case, lj k 
is the transpose of the (/, k) element of L. 

This Cholesky algorithm may be recast 
as a block algorithm without difficulty. 


Let matrices A and L have a block matrix 
structure A = I I Ayl I!j =1 , L=|M>-1, 
with each block A,y and L y a b x b square 
submatrix. We now interpret (A kk )^ to be 
the lower triangular factor of the Cholesky 
decomposition of A kk and (L tt ) 4 to be the 
matrix inverse of h kk . The block Cholesky 
algorithm is shown in Figure 7. For sym¬ 
metric, positive definite matrices A, this 
algorithm is well defined (L^ k exists) and 
stable. The similarities of scalar and block 
examples extend to many block 
algorithms; the most common difference 
occurs when one replaces a scalar opera¬ 
tion by a small block calculation (in 
Figures 6 and 7, for example, reformula¬ 
tion of a square root as a Cholesky factor). 

The principal advantage of block 
algorithms is their local-data-access pat¬ 
tern. This pattern implies that once a block 
of data is brought into a processor’s local 
memory, the block may be reused repeat¬ 
edly before being written back to main 
memory. Block algorithms are therefore 
well suited to machines with fast local 
memory because such algorithms require 
fewer accesses to main memory for a given 
number of arithmetic operations. 

In addition to matrix multiplication and 
Cholesky decomposition, block 
algorithms have been developed for QR 
factorization 10,11 and the eigenvalue and 
singular value decompositions. 12,13 For 
these and other problems, block 
algorithms produce a dramatic reduction 
in memory traffic. For example, Figures 
8 and 9 describe matrix multiplication of 
two 1024 x 1024 matrices by vector and 
blocked methods, respectively. (We 
assume that the computing processors 
have local memory sufficient to hold a sin¬ 
gle vector or block, as the case may be.) 
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B = C 



Figure 8. I/O requirements of vectorized matrix multiplication. All of matrix A 
must be transferred from memory to compute each column of result matrix C. On 
a 1024 x 1024 matrix multiplication, this totals over 1 billion memory accesses. 



B = C 



Figure 9. I/O requirements of blocked matrix multiplication. All of matrix A must 
be transferred from memory to compute each block column of result matrix C. 
With blocks of size 32 X 32, the matrix multiplication in this figure requires only 
34 million memory accesses. Blocked matrix multiplication reduces I/O require¬ 
ments by a factor of 32. 


Figure 8 indicates that all of matrix A must 
be transferred to local memory to compute 
each column of the result matrix C (a total 
of 1.1 billion data transfers). Under the 
blocked algorithm, 32 x 32 blocks are 
fetched into local memory. Figure 9 indi¬ 
cates that all of matrix A must be trans¬ 
ferred into local memory to compute each 
block column of the result block matrix C. 
Since there are only 32 block columns, the 
data-transfer requirements are reduced by 
a factor of 32. 

All block algorithms reduce the number 
of data transfers from main memory. Fig¬ 
ure 10 shows the performance advantages 
of several important block algorithms. 
Depending on problem size and I/O 
requirements, algorithms organized to use 
32 x 32 blocks of data are superior by fac¬ 
tors of 8 to 32 to their scalar counterparts. 

Efficiency of data transfers is also sig¬ 
nificant. If block algorithms are used to 
structure computations, all of the neces¬ 
sary data movements can be specified at 
compile time rather than at run time. Data 
can be transferred from main to local 
memory in anticipation of requests, which 
eliminates cache misses and the resulting 
processor waits. The control of these data 
transfers can also be handled in parallel 
with the control of the processing units. 


Block 

processing 

I/O 

reduction 

factor 


Matrix multiply and 



Matrix size N x N in blocks 
(Blocks are 32 x 32 submatrices) 

Figure 10. An illustration of the reduction in input/output requirements of block 
processing compared with scalar processing. Four common mathematical calcula¬ 
tions are compared in terms of their input/output requirements on main memory. 
Block processing of matrices composed of 32 x 32 blocks yields an I/O reduction 
factor as high as 32 times. 


Comparison of Matrix-1 
with other approaches 

Warp 3 is the only other general- 
purpose systolic computer. In certain 
respects, the two machines are quite simi¬ 
lar: Each has a general-purpose host, a 
global memory, and a linear array of 
processors with local memory. Other sys¬ 
tolic hardware 8,14 ' 15 has been developed, 
but not with the goals of programmabil¬ 
ity and generality characteristic of the 
Matrix-1 and the Warp. 

There are some important differences, 
however. The Warp array is an MIMD 
computer. The Warp array allows two- 
way systolic dataflow. The Warp array has 
neither a global data path nor global I/O 
paths. Peak performance of the Matrix-1 
is about 10 times greater than that of the 
Warp. 3 

The programmer’s model is also differ¬ 
ent. Although the Warp programming lan¬ 
guage, W2, generates code for the whole 
Warp—including the host, the interface 
unit, and the array—it is the programmer’s 
responsibility to determine which compu¬ 
tations are performed by which cells of the 
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array and to move the data among the cells 
and between the cells and the memory 
explicitly. 

The Matrix-1 programmer must also 
decide what is done in the host and what 
is done in the processor array. But the 
array program is written in Fortran. The 
Fortran compiler is responsible for more 
of the optimization than is the W2 
compiler. 

T he Matrix-1 is the first commer¬ 
cial, general-purpose, systolic 
computer. It consists of a power¬ 
ful programmable systolic array (the 
Matrix Processor) that has a rich set of 
data paths and a very fast local memory 
and is coupled to a System Controller and 
a System Memory that together act as a 
highly optimized host for the array. The 
Matrix Processor achieves 1 GFLOP 
throughput by means of 32-fold parallel¬ 
ism, fast (64 ns) pipelined floating-point 
units, and fast and flexible local memories. 
Its global data path allows for broadcast 
as well as nearest neighbor connections. 
Because of its rich connectivity it is adapt¬ 
able to many SIMD algorithms. And with 
its Fortran compiler, software develop¬ 
ment for new applications is fast and inex¬ 
pensive. 

The design of the Matrix-1 explicitly 
employs reduced memory bandwidth in 
support of large computational through¬ 
put; this approach was taken because 
many systolic and block algorithms have 
low I/O requirements. The local memories 
of the Matrix Processor are used as a block 
store for supporting compute-intensive 
block and systolic algorithms. 

The Matrix-1 is being applied to prob¬ 
lems in signal processing, seismic process¬ 
ing, and numerical linear algebra. 16 
Sustained computational rates in excess of 
900 MFLOPS have been achieved at the 
Saxpy Computer Corp. facilities. The first 
delivery of a Matrix-1 to a beta-test site 
occurred in April 1987. Full production is 
scheduled for late 1987. □ 
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Recent signal-processing 
algorithm 
developments have 
stressed direct 
methods that operate 
on the data matrix via 
orthogonal matrix 
decompositions. 

Systolic arrays appear 
to be quite well 
matched to the 
requirement for 
real-time computation 
of these algorithms. 
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S ystolic array computer architec¬ 
tures provide a means for fast com¬ 
putation of the linear algebra 
algorithms that form the building blocks 
of many signal-processing algorithms, 
facilitating their real-time computation. 
For applications to signal processing, the 
systolic array operates on matrices, an 
inherently parallel view of the data, using 
numerical linear algebra algorithms that 
have been suitably parallelized to effi¬ 
ciently utilize the available hardware. This 
article describes work currently underway 
at the Naval Ocean Systems Center, San 
Diego, California, to build a two- 
dimensional systolic array, SLAPP, 
demonstrating efficient and modular 
parallelization of key matrix computations 
for real-time signal- and image-processing 
problems. 

Matrix-based signal 
processing 

Recent signal-processing algorithm 
developments tend to provide improved 
performance through the incorporation of 
a priori information and the utilization of 
data dependent basis vectors for signal 
representation. Our goals are to provide 
computationally efficient, numerically sta¬ 
ble parallelization of statistically effective 
signal-processing techniques. 

0018-9162/87/0700-0045$01.00©1987IEEE 


Such algorithms require a fairly broad 
suite of matrix computations. The key 
algorithms required 1 are matrix multipli¬ 
cation or matrix-vector multiplication, 
orthogonal reduction to triangular form, 
triangular backsolve, singular value 
decomposition (SVD), and generalized 
singular value decomposition (GSVD). 2,3 

Matrix-vector multiplication provides 
the implementation of linear transforma¬ 
tions, such as those needed for classical 
beam forming and image data com¬ 
pression. 

Orthogonal triangularization may be 
combined with a triangular backsolve to 
provide linear equation solution and least 
squares solutions. 4 Least squares solu¬ 
tions are needed for adaptive beamform¬ 
ing, interference cancellation, and model 
fitting. 5,6 

The SVD has many applications, 
including rank determination, reduced 
rank approximation/data compression, 
and regularized least squares solution. It 
may also be used to find the eigensystem 
of C = A H A. The eigenvalues of C are 
the squares of the singular values of A, and 
the eigenvectors of C are the right singu¬ 
lar vectors of A. In this and many other 
computations, which conceptually use C, 
it is numerically preferable to compute 
directly with A and not form A H A. 

The MUSIC 7 direction finding method 
of R. Schmidt requires solution of a Her- 
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Figure 1. (a) The Luk QRD systolic architecture, (b) The Luk triangular SVD systolic architecture, (c) Synthesis of the QRD 
and SVD architectures into the unified architecture. The lines drawn between processors denote the dataflow. 


mitian generalized eigensystem of the form 
A H Ax = A B^Bx. As in the ordinary Her- 
mitian eigensystem problem, one would 
compute directly with A and B. 

Parallel matrix 
computation 

For real-time implementation, it is desir¬ 
able to provide O (N) parallelism for the 
0(7V 2 ) computations and 0(7V 2 ) parallel¬ 
ism for the 0(/V 3 ) computations, so that 
both may be performed in O (N) time. Sys¬ 
tolic arrays 8 are regular arrays of proces¬ 
sors with common control, synchronous 
timing, and only nearest neighbor inter¬ 
connection. When a systolic design is 
available, it provides modularity and rela¬ 
tive simplicity of design, since throughput 
scales linearly with the number of cells 
used, and usually only one or two types of 
cell is needed. 

Noniterative matrix 
operations 

For general matrix multiplication, 
orthogonal reduction to triangular form 
and triangular backsolves systolic forms of 
the optimal or nearly optimal serial 
algorithms are available. The engagement 
architecture 9 uses an TV x N array of 
processors to perform N x Nmatrix mul¬ 


tiplication in time 3 N - 2 with no pipelin¬ 
ing of successive matrix multiplications, or 
time N with such pipelining. It uses the H. 
T. Kung inner product step processor, 8 
and requires only multiplications and 
additions in the cells. 

Triangular backsolves may be per¬ 
formed efficiently by the linear systolic 
array of H. T. Kung. 8 ' 10 Interior cells per¬ 
form only multiplications and additions, 
while one boundary cell is required to per¬ 
form divisions. 

Efficient parallelizations of orthogonal 
triangularization through Givens’ method 
of plane rotations are available. 1011 A 
plane rotation replaces two vectors x and 
y by 

X' = cx + sy (1) 

y' - -sx + cy 

where c 2 + s 2 = 1, so that c and s may 
be interpreted as the cosine and sine, 
respectively, of the rotation angle. For¬ 
tunately, trigonometric functions will not 
be needed to compute the rotation 
parameters. 

Suppose the the vectors x and y are to be 
rotated to introduce a zero into the lead¬ 
ing element y, of y. Then c and s would be 
specified by 



In the Gentleman-Kung triangular array, 
the rotations are computed by the bound¬ 
ary processors, on the main diagonal, and 
the c and s values are passed to the right. 
Interior cells compute multiplications and 
additions to perform the rotations. 

Iterative matrix 
operations 

Computation of the singular value 
decomposition, or SVD, eigensystems, 
and the generalized singular value decom¬ 
position, or GSVD, is necessarily iterative. 
For the SVD and the Hermitian eigen¬ 
system computations, the fastest numeri¬ 
cally stable serial algorithms, the 
Golub-Reinsch SVD and the QR eigen¬ 
system algorithms, do not lend themselves 
well to parallelization. They have require¬ 
ments for global communication, as well 
as several separate phases to the reduction, 
each requiring a unique architecture. 

We have therefore selected methods 
based on locally computed plane 
rotations—parallel algorithms inspired by 
the Jacobi eigensystem method and the 
Nash-Hestenes SVD algorithm. We 
broadly refer to these as “Jacobi 
methods.” 

Algorithms. The QRD (QR decomposi¬ 
tion), SVD, and GSVD algorithms of F. T. 
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Luk 11 ' 13 will be implemented on a new 
unified, bitriangular architecture. This 
same architecture can also be used to per¬ 
form matrix multiplication and solution of 
triangular systems. 

We will now describe the QRD and SVD 
algorithms through the Luk QRD and 
SVD systolic architectures shown in Fig¬ 
ure 1. The GSVD 12 requires the computa¬ 
tion of simultaneous QRDs of two 
matrices, say A and B, as in the discussion 
of the MUSIC algorithm above, forming 
Ra and R b (upper triangular), followed 
by the SVD of R A R ” B , though R B “ 1 and 
the product are not found explicitly. 

The triangular SVD 
algorithm 

The SVD algorithm 12 described here is 
a Jacobi-like method for upper triangular 
matrices. Since the matrix must be square, 
this algorithm requires an initial QRD, 
which is described in the next section. 

This algorithm begins by computing 
block 2x2 SVDs along the main 
diagonal processors of the systolic array. 
First the blocks are chosen so that the lead¬ 
ing diagonal element of each 2 x 2 block 
has an odd index. After the odd blocks are 
diagonalized and the updating of the other 
elements has proceeded far enough, to 
avoid data conflicts, the 2 x 2 blocks with 
the leading diagonal element of even index 
are decomposed into diagonal form. This 
is known as the odd-even ordering 
scheme 11,14 (see Figure 1). By choosing 
outer rotations, 12 together with the odd- 
even ordering, the data is shuffled around 
and the entire matrix is gradually reduced 
to a diagonal matrix—all these without 
destroying the zeros in the lower triangu¬ 
lar part of the original upper triangular 
matrix. 


The QRD algorithm 

The QRD algorithm 12 is a modification 
of the Gentleman-Kung algorithm 10 in 
terms of the time skewing of the input 
data. With this time skewing, one can 
imagine the data input as pairs of columns 
fed into each processor from the top of the 
systolic array, with successive pairs of 
columns delayed one time unit at succes¬ 
sive processors. 

As the data flows through the array, 
each processor operates on a 2 x 2 sub¬ 
matrix; the rotation sines and cosines are 


computed in the boundary processors (the 
processors along the main diagonal) and 
sent to the right, across the array, where 
they are applied to each 2 x 2 submatrix 
at the internal processors. The time skew¬ 
ing ensures that the sine-cosine pair is 
operating on the appropriate two rows of 
the data matrix. 

Both the QRD and SVD algorithms use 
plane rotations to reduce the input 
matrices—to triangular form for the QRD 
and diagonal form for the SVD. This 
means that sine-cosine pairs, for rotations 
similar to Equation (2), must be com¬ 
puted. From Equation (2) it is evident that 
this computation relies heavily on the abil¬ 
ity to compute division and square roots, 
implying that, for real-time applications, 
some hardware provision must be made 
for their speedy computation. 

Extensions to the above algorithms are 
being developed to handle oversized prob¬ 
lems, 15 that is, matrices with a column 
dimension higher than the number of 
columns in the systolic array. These 
algorithms involve some method to fold 
and store portions of the matrix for later 
postprocessing. 


The unified systolic array architecture. 
The Luk QRD and SVD algorithms 
require two slightly different, systolic 
array architectures. As more processors 
are added, to accommodate larger 
matrices, more “SVD processors” are 
required than “QRD processors” to solve 
the same-size problem. The result is that 
some of the “SVD processors” will be idle 
while the QRD algorithm is in progress. 
Also, because of the use of odd-even 
ordering in the SVD algorithm, half the 
processors are idle during odd (even) 
cycles. The unified architecture overcomes 
these inefficiencies by synthesizing the 
QRD and SVD architectures into one 
architecture similar to the Gentleman- 
Kung triangular architecture. 

Figures la and lb show the Luk QRD 
and SVD architectures required to decom- 
poseanM x 8 matrix. The QRD array is 
composed of 20 processors while the SVD 
array has 22 processors. Superimposing 
the two arrays, and grouping together two 
even and one odd SVD processor with two 
QRD processors into one unified proces¬ 
sor on the main diagonal and forming 
internal unified processors by grouping 
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Figure 2. The bitriangular array configured for (a) the GSVD and (b) matrix mul¬ 
tiplication. 



Figure 3. The SLAPP unified processor with independent I/O and computational 
processors. 


adjacent odd-even pairs of SVD proces¬ 
sors together with two QRD processors, 
results in the unified architecture, as por¬ 
trayed in Figure lc. The original Luk SVD 
and QRD processors are now virtual cells 
of a unified processor; in other words, they 
are realized as virtual cells through micro¬ 
code programs. This realization through 
microcoding of a pipelined arithmetic unit 
allows one to use the arithmetic unit effi¬ 
ciently by keeping the pipeline filled most 
of the time. 

One obvious advantage of the unified 
architecture is a reduction in the number 
of required processors (from 22 to 10 for 


an M x 8 problem), reducing costs sig¬ 
nificantly. A disadvantage is the reduction 
in throughput of the QRD algorithm, 
compared to a fully parallel implementa¬ 
tion, since two QRD cells reside in a sin¬ 
gle unified processor. However, the 
improved efficiency of the SVD algorithm 
and the cost savings realized from using 
less hardware, per array, will more than 
compensate for this degradation in the 
QRD throughput. 

Another algorithm being implemented 
on the systolic linear algebra parallel 
processor, or SLAPP, is the GSVD, or 
generalized singular value decomposition. 


To accommodate this algorithm, the 
SLAPP will be composed of two triangu¬ 
lar, unified systolic arrays connected point 
to point, forming a bitriangular array, as 
depicted in Figure 2a. With the upper and 
lower boundary processors overlapped, 
thereby forming a rectangular array, this 
architecture can also be configured for 
matrix multiplication (Figure 2b). 

The SLAPP hardware. The first work 
at the Naval Ocean Systems Center 
(NOSC) on systolic arrays was begun in 
late 1979. A project was initiated to build 
a flexible test bed, called the systolic array 
processor (SAP), 16 ' 18 in order to learn 
how to design, build, and program systolic 
arrays. The SAP was designed and fabri¬ 
cated in about 18 months. The SAP was 
programmed to demonstrate a beam¬ 
forming application as well as the SVD 
algorithm of Brent and Luk. 19 The SAP 
has subsequently been used for explora¬ 
tion of electronic countermeasure and 
speech compression algorithms. 

Furthermore, the real-time require¬ 
ments of Navy communications systems 
brought about an effort to design a high¬ 
speed SAP (HISAP), 20 which could solve 
a Navy application. 

Considerable effort went into the pro¬ 
gramming of the algorithms on the SAP. 
However, the bulk of the programming 
effort is not in the implementation of the 
algorithm, but in the generation of tools 
to verify the operation of the array. It is the 
“nonrecurring” initial setup effort that 
makes the first algorithm difficult to pro¬ 
gram. Software tools developed and cur¬ 
rently being developed for programming 
HISAP will make the task of program¬ 
ming the SLAPP less arduous. 

There are two separate processors in a 
SLAPP unified processor 21 (Figure 3). 
One processor is used for I/O and the 
other for arithmetic computations. In this 
way the user can separate data movement 
from the computations on the data. 

With this dual-processor architecture a 
processor can be working on one algo¬ 
rithm while data for a second algorithm is 
being received from a neighboring process¬ 
ing element. Also, processing may be data 
dependent, so that individual processors 
will be doing somewhat different opera¬ 
tions on their data than neighboring 
processors. This is usually difficult to per¬ 
form in tightly synchronous systems. 

Simulation of the algorithm dataflow, 
in PC-Matlab, has provided valuable 
information for a more detailed C lan¬ 
guage simulation of the SLAPP hardware. 
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The goal of the C simulation will be to 
develop exact formulas for the indexing of 
memory addresses within the SLAPP. 
These formulas will be used for gate-level 
simulation on a Daisy workstation before 
being microcoded on the SLAPP. 

T he SLAPP project is an effort to 
demonstrate the feasibility of 
solving real-time signal- and 
image-processing problems using a two- 
dimensional systolic array that implements 
a fairly wide range of linear algebra 
algorithms. The bitriangular, unified 
architecture is a computationally efficient 
and cost-effective implementation of the 
engagement processor and the Luk QRD, 
SVD, and GSVD algorithms. Modular 
hardware and software tools will alleviate 
much of the programmer’s burden, while 
high- and low-level simulation of the hard¬ 
ware provide smooth transition to final 
hardware design. □ 
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Two major UK systolic 
array projects are 
described. The first 
concerns the 
development of a 
wavefront array 
processor for adaptive 
beamforming, the 
second the design of 
novel high-performance 
signal-processing chips. 


F or the last five or six years, there 
has been an active program of 
research on systolic array proces¬ 
sors (SAPs) in the United Kingdom. In this 
article we describe two key projects 
initiated at the Royal Signals and Radar 
Establishment (RSRE) and carried out in 
collaboration with a number of major 
electronics companies and several univer¬ 
sities. The success of these projects demon¬ 
strates clearly that the systolic approach to 
system design is not just an academic con¬ 
cept but a practical means of exploiting 
large amounts of parallelism and hence 
achieving orders of magnitude improve¬ 
ment in performance for digital signal 
processing (DSP). This type of application 
not only demands but also fully utilizes the 
SAP architecture. 

The first project was aimed at develop¬ 
ing an electronic processor capable of 
computing the vector of complex weights 
required to form the receive beam for an 
adaptive antenna array. 1 This involves the 
solution, in real time, of a set of linear 
equations and clearly requires many 
processors operating in parallel in order to 
achieve the required processing band¬ 
width. The corresponding design problem 
appeared to be almost intractable using the 
type of signal-processing architectures and 
components then available. However, 
when H. T. Kung and C. E. Leiserson first 


published their research on systolic 
arrays, 2 it became apparent that this form 
of dedicated parallel processing architec¬ 
ture offered a realistic chance of provid¬ 
ing a throughput consistent with the 
requirements of real-time military applica¬ 
tions. A program of research on adaptive 
digital beamforming was set up jointly 
between RSRE and STC Technology Ltd. 
(STL) and an adaptive antenna processor 
test bed (the AAPT), based on a triangu¬ 
lar wavefront array processor (WAP), has 
been successfully constructed. The AAPT 
was designed as an experimental tool using 
off-the-shelf hardware and software and 
is capable of processing =6 X 10 4 com¬ 
plex data samples per second. This is not 
sufficient for a real-time application but, 
as a natural extension of the project, and 
as a means of providing the throughput 
required for future systems, a high- 
performance node processor chip (the 
NPC) is now being designed at STL. This 
work on adaptive beamforming is 
described in the next section. 

The second project concerns the devel¬ 
opment of bit-level systolic array proces¬ 
sors. This work was originally motivated 
by the fact that the SAP architecture is 
ideal for VLSI design and yet, since a com¬ 
plete multiply and accumulate chip was 
required to implement the basic inner 
product step processor for most of the 



































Figure 1. RSRE adaptive antenna: (a) “canonical” adaptive combiner; (b) constraint preprocessor. 


designs proposed in Kung and Leiserson, 2 
the integration of a complete SAP was not 
feasible. However, subsequent research by 
the authors showed that many important 
digital signal processing (DSP) chips such 
as correlators and digital filters could be 
designed as SAPs in their own right, the 
basic processing element being defined at 
the single-bit level. 3 Since an entire SAP 
of this type may be integrated on a single 
VLSI chip, the inherent advantages of 
regularity and nearest-neighbor 
interconnections may be exploited to ease 
the design and enhance the throughput of 
such devices. These ideas were warmly 
received by various UK electronics compa¬ 
nies and, in particular, by scientists at the 
GEC Hirst Research Centre (HRC), with 
whom a joint research program was estab¬ 
lished to develop some high-performance 
DSP products. Several chips have now 
been successfully fabricated and two of 
these, a high-speed correlator and a rank- 
order filter, are now available commer¬ 
cially through Marconi Electronic Devices 
Ltd. (MEDL), which is also part of 
GEC. 4 The work on bit-level SAPs is dis¬ 
cussed later. 


Adaptive beamforming 

The type of adaptive antenna we have 
developed 1 is illustrated schematically in 
Figure 1. It comprises a simple preproces¬ 
sor and a “canonical’ ’ adaptive combiner 


that operate on complex, discrete signals 
digitized at zero-IF (intermediate fre¬ 
quency). The inputs to the combiner take 
the form of a primary signal y(t) and a set 
of Ar - 1 auxiliary signals x(t). The pur¬ 
pose of the adaptive combiner is to adjust 
the complex weight vector w (and hence 
the phase and amplitude of the received 
signal in each channel) so as to minimize 
the power of the combined output signal 

e(t) = x r (/)w + y(t) (2.1) 

This has the effect of creating deep inter¬ 
ference nulls that suppress the (unwanted) 
signals received from any direction other 
than a chosen “look direction” 0 (which 
defines the corresponding constraint vec¬ 
tor c(0)). The constraint preprocessor 
ensures that any signal arriving from the 
specified look direction is not suppressed, 
the adaptive antenna gain in that direction 
being held at a constant value p. In the fol¬ 
lowing subsections we restrict our atten¬ 
tion to the design and implementation of 
a canonical adaptive combiner suitable for 
real-time application. 

The choice of algorithm and architec¬ 
ture. Since the principal objective of our 
research on adaptive beamforming was to 
develop a processor capable of adapting its 
response characteristics very rapidly, 1 it 
was decided to compute the optimum 
weight vector at each sample time using an 
exact least squares (LS) algorithm and so 
avoid the problems of slow convergence 
associated with gradient techniques. This 
is a nontrivial real-time computation that 


clearly demands the very high processing 
power offered by SAPs. In order to har¬ 
ness this power, however, great care must 
be given to the choice of algorithm and 
array architecture employed. The LS 
weight vector at time t„ is given explicitly 
by the so-called Wiener-Hopf equation 
X"(n)X(/i)w(») = X H (n)y(n) (2.2) 
where X(n )denotes then X N - 1 matrix 
of all data entering the auxiliary channels 
up to time t„, y(n) is the corresponding n- 
element vector of data in the primary 
channel, and the superscript H denotes 
Hermitian conjugation. An obvious (and 
commonly used) method for determining 
the LS weight vector involves computing 
the N - 1 x N - 1 (covariance) matrix 
X H (n)X(n) and solving Equation (2.2) 
directly. Since the solution could, in fact, 
be computed using some of the earliest 
SAP techniques, this approach looks very 
attractive at first sight. However, it is not 
very suitable from the numerical point of 
view since explicit formation of the covar¬ 
iance matrix involves the squaring of ele¬ 
ments in the underlying data matrix X(n), 
and this causes a significant loss of 
accuracy when using finite precision arith¬ 
metic. Since robust performance under the 
conditions of limited precision arithmetic 
is such an important feature of any signal¬ 
processing algorithm, we decided to per¬ 
form the LS computation using the 
method of QR decomposition (QRD) by 
square-root-free Givens rotations. 5 It is 
not appropriate to describe the algorithm 
in this article except to say that it is known 
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to be one of the best available from a 
numerical point of view. An extensive 
computer simulation study was carried out 
to determine the minimum word length 
required for the QRD algorithm in a prac¬ 
tical adaptive beamformer. 1 A 24-bit 
floating-point number representation 
(16-bit mantissa and 8-bit exponent) was 
found to give very satisfactory perform¬ 
ance over a wide range of realistic 
scenarios. Over the same range of 
scenarios, the minimum number represen¬ 
tation required for the conventional LS 
solution in Equation (2.2) was found to be 
a 32-bit floating-point number (24-bit 
mantissa and 8-bit exponent). This clearly 
reflects the superior numerical properties 
of the QRD algorithm. The 24-bit floating¬ 
point format, which is suitable for a wide 
range of digital signal processing (DSP) 
applications, was therefore chosen for the 
adaptive antenna processor test bed, or 
AAPT. 

For the purposes of real-time DSP, it is 
pointless, of course, to adopt a sophisti¬ 
cated numerical algorithm if it cannot be 
implemented on a sufficiently high speed 
processor. Fortunately, Gentleman and 
Kung 6 have shown how the QRD algo¬ 
rithm described above may be imple¬ 
mented in a very efficient pipelined 
manner using a triangular SAP of the type 
illustrated schematically in Figure 2. The 
operation of this array is described in detail 
in Ward et al., 1 and it is not appropriate 
to repeat that discussion here. Suffice it to 
say that as the data matrix X(«) enters the 
triangular array ABC, sample vector by 
sample vector, it is reduced in an efficient 
recursive manner to the form D^ 2 (n)R(«) 
by a sequence of elementary unitary trans¬ 
formations. R(«) denotes an A - 1 x 
N - 1 unit upper triangular matrix whose 
off-diagonal elements are stored in the cor¬ 
responding “internal” cells and D(n) 
denotes an N - 1 x N - 1 diagonal 
matrix whose elements are stored in the 
appropriate “boundary” cells. Applying 
the same sequence of unitary transforma¬ 
tions to the data vector y(«) produces an 
N - 1 element vector u(n), which is stored 
in the right-hand column of cells DE. The 
LS weight vector at time t„ is then given 
by the triangular system of linear 
equations 

R(«)w(n) + u(n) = 0 (2.3) 

These are readily solved by back- 
substitution and Gentleman and Kung 
proposed the use of a separate linear SAP 
to perform this operation. 6 However, in 
an adaptive beamformer, the essential task 



Figure 2. Systolic/wavefront array for adaptive beamforming. 


is to compute the LS residual e(t„) while 
the corresponding weight vector is not 
required per se. McWhirter 7 has shown 
how the residual may be evaluated directly 
at each stage of the recursive process with¬ 
out any need to compute the weight vector 
explicitly. The residual signal is, in fact, 
computed very simply using the final mul¬ 
tiplier cell in Figure 2. Since the back- 
substitution circuit and separate beam¬ 
forming network are both eliminated by 
this modification, it offers a significant 


reduction in hardware complexity. It also 
renders the algorithm even more robust 
since it avoids solving a set of linear equa¬ 
tions that could, under certain circum¬ 
stances, be numerically ill-conditioned. 

In effect, the SAP in Figure 2 constitutes 
an extremely sophisticated digital filter 
that acts like a funnel taking in several 
regular data streams and producing one 
regular stream of processed data. Shep¬ 
herd and McWhirter 8 have recently 
shown how it may, in fact, be used with- 
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Figure 3. Schematic of STL-RSRE adaptive antenna processor test bed (AAPT). (Reproduced by permisssion of STC 
Technology Ltd.) 


out modification to solve a general LS 
problem with multiple linear constraints 
and so it constitutes a particularly impor¬ 
tant digital signal processing architecture 
that has been referred to on several occa¬ 
sions as the “jewel in the crown” of SAPs. 
The fact that an advanced numerical algo¬ 
rithm has been incorporated into such a 
simple and regular processing structure 
highlights the benefit of developing 
algorithms and architectures together. 

The processor network in Figure 2 may 
also be implemented as a wavefront array 
processor, or WAP, of the type proposed 
by Kung et al. 9 The overall function of the 
WAP is identical to that of its systolic 
counterpart, the only difference being that 
the individual node processors do not 
operate synchronously. Instead, the oper¬ 
ation of each processor is controlled 
locally and depends on the necessary input 
data being available and its previous out¬ 
puts having been accepted by the appropri¬ 
ate neighboring processors. As a result, it 
is not necessary to impose a temporal skew 
on data input to the triangular WAP; the 
associated processing wavefront develops 
naturally within the array. An important 
advantage of the WAP architecture is that 


it avoids the problems associated with 
high-speed clock distribution. Further¬ 
more, using an array of transputer emula¬ 
tors programmed in Occam, Broomhead 
et al. 10 have demonstrated that in situa¬ 
tions where the processing time associated 
with each node is data dependent, a WAP 
can actually achieve higher throughput 
than its systolic counterpart. It was there¬ 
fore decided to implement the adaptive 
antenna processor test bed, or AAPT, as 
a WAP. In order to operate in this mode, 
every processing element must, of course, 
incorporate some additional circuitry to 
implement a bidirectional handshake on 
each of its input/output links and thereby 
ensure that the necessary communication 
protocol is observed. This represents an 
overhead that is not negligible but was eas¬ 
ily absorbed within each node of the 
AAPT. 

The STC Technology Ltd.-Royal Sig¬ 
nals and Radar Establishment adaptive 
antenna processor test bed. An AAPT 
incorporating the type of triangular WAP 
in Figure 2 has recently been constructed 
by a team of engineers at STC Technology 
Ltd., or STL, working in collaboration 


with scientists at the Royal Signals and 
Radar Establishment, or RSRE. 11 The 
main purpose of the project was to demon¬ 
strate this novel approach to adaptive 
beamforming in a real system environ¬ 
ment. As illustrated in Figure 3, the AAPT 
comprises two main subsystems: 

(1) the antenna unit and RF/IF (radio fre¬ 
quency/intermediate frequency) 
front-end subsystem, comprising a six- 
element antenna array operating at 
1290 MHz and a six-channel modular 
RF/IF receiver, and 

(2) the digital processor subsystem com¬ 
prising zero-IF receivers and analog- 
to-digital conversion, the adaptive 
beamforming array, a system timing 
and control processor interface, and 
dual output digital-to-analog con¬ 
verters. 

The processor subsystem can accept up 
to six inputs in either analog or digital 
form. The zero IF boards (one per chan¬ 
nel) operate at an IF of 184 MHz and have 
a bandwidth of 26 kHz with a gain of 
about 40 dB. The maximum sampling fre¬ 
quency of the A/D converters is 200K sam¬ 
ples per second and the digital data is 
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output as in-phase and quadrature chan¬ 
nels in 12-bit two’s-complement format. 
Alternatively, digital baseband signals can 
be input directly to the processor array. 

The adaptive beamformer comprises 33 
identical processing nodes, 21 of which 
constitute the type of triangular wavefront 
array processor illustrated in Figure 2 (the 
TWAP). The remaining 12 processors per¬ 
form a data correction and constraint 
application function and are labeled as the 
DCWAP (data correction wavefront 
processor) in Figure la. Two processing 
nodes have been fitted onto one extended 
printed circuit board measuring 10 x 9 
inches (Figure 4) and 17 such boards are 
used to accommodate the 33 active proces¬ 
sors. Ten other cards provide the data 
acquisition, system timing, and interface 
control functions. The processor sub¬ 
system (Figure 5) is housed in a 4-ft rack¬ 
ing enclosure along with power supplies 
and fan cooling. The total power con¬ 
sumed by this unit is approximately 500 

It was decided to base the processing 
node design on a standard programmable 
DSP chip in order to reduce the system 
development cost and provide flexibility in 
the choice of test-bed algorithm for a range 
of applications. The Texas Instruments 
TMS32010 was chosen since the chips and 
the software development system were 
both readily available at the time. As illus¬ 
trated in Figure 6, additional hardware is 
used to provide multiple 16-bit I/O ports 
with input FIFO buffering, a lookup table 
(for reciprocals, etc.), floating-point 
renormalization logic, localized control, 
and a bit-serial diagnostic data link. The 
node design also incorporates program 
memory ROM (containing “fixed” boot¬ 
strap and node algorithm code) and pro¬ 
gram memory RAM, which allows one to 
download additional node programs and 
thereby investigate the performance of 
other algorithms. 

The processor subsystem is controlled 
via a terminal from which the operating 
mode can be set up and selected. The link 
to an external computer allows programs 
to be downloaded to the array nodes (each 
of which has its own unique address) and 
the node contents to be dumped back to 
the host during software development. 
The final operational code for each node 
can be programmed into on-board ROM, 
whereupon the array may operate in an 
entirely stand-alone manner. 

Basing the adaptive antenna processor 
test bed, or AAPT, development on a 
WAP greatly simplified the overall system 



Figure 4. Node processor board for the AAPT. (Reproduced by permission of STC 
Technology Ltd.) 


Figure 5. STL-RSRE adaptive antenna processor test bed. (Reproduced by 
permission of STC Technology Ltd.) 
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Figure 6. AAPT processing node configuration. (Reproduced by permission of STC Technology Ltd.) 
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Figure 7. Experimental results for multiple jammer suppression: (a) signal power spectrum before adaptation; (b) signal power 
spectrum after adaptation. (Reproduced by permission of STC Technology Ltd.) 


design in several ways. For example, it was 
necessary to design and test only one 
processor board for both the TWAP and 
DCWAP. Furthermore, since the 
sequence and timing of operations for each 
processor is controlled autonomously, the 
global system control is trivial compared 


with that which would be required for a 
more conventional design. Avoiding the 
problems of high-speed clock distribution 
simplified the electrical design and con¬ 
tributed significantly to the fact that the 
hardware (which sustains a processing rate 
of =150 x 10 6 TMS32010 instruction 


cycles per second) functioned perfectly 
first time and gives bit-for-bit agreement 
with a software emulator developed on the 
VAX. With an asynchronous system, of 
course, great care must be taken to ensure 
that the communication links do not fail 
because of the problem of metastable 
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states. In this respect, the AAPT design 
relies entirely on the asynchronous charac¬ 
teristics of the TMS32010 and, so far, no 
problems have been experienced in 
practice. 

Like most programmable DSP chips, 
the TMS32010 uses a 16-bit fixed-point 
number representation and so the floating¬ 
point operations required for the test-bed 
array have to be software programmed. 
This consumes a surprisingly large number 
of machine cycles and places a significant 
restriction on the throughput rate achiev¬ 
able. For example, =150 cycles are 
required to implement one complex 24-bit 
floating-point multiplication, although the 
exact number is, of course, data depen¬ 
dent. Given the 200-ns instruction cycle for 
the TMS32010, the corresponding compu¬ 
tation time is = 30 ^s. Since the internal 
node of the TWAP must perform two 
complex multiply/accumulate operations 
per sample time, and this constitutes the 
main processing bottleneck, the AAPT 
can process = 10 4 complex samples per 
second from each of the six receiver chan¬ 
nels. Although this is too slow for most 
real-time applications, it is more than 
sufficient to carry out a wide range of 
laboratory experiments and trials. 

Tests to date have included a systematic 
evaluation of the cancellation perform¬ 
ance achieved using IF sources in the 
laboratory together with initial perform¬ 
ance assessments in an anechoic chamber. 
The cancellation performance against 
multiple signals is well illustrated by the 
example in Figure 7. The top illustration 
shows the combined signal spectrum 
before adaptation and indicates the indi¬ 
vidual signal and jammer power levels. 
The lower illustration shows the combined 
signal spectrum after adaptation subject to 
a constraint that maintains the gain in a 
wanted signal direction. It can be seen that 
all unwanted signals have been suppressed 
below the desired signal level that has not 
been affected. This represents an improve¬ 
ment in output signal-to-noise ratio of 
approximately 33 dB. The AAPT consti¬ 
tutes an invaluable research tool that can 
demonstrate the performance achievable 
with a wide range of advanced adaptive 
algorithms. It is believed that this hard¬ 
ware represents the first working demon¬ 
stration of a WAP anywhere in the world 
and that its success will lead to the use of 
WAPs for solving other computationally 
intensive problems. 

Recent advances in chip technology 
have seen the introduction of commercial 
floating-point DSP microprocessors such 


as the AT&T WE DSP32 or the INMOS 
IMS T800 Transputer. Although these 
could now enhance the AAPT through¬ 
put, they were not available at the time of 
design and are still unable to provide suffi¬ 
cient processing power for most practical 
applications. A radical advance in node 
processor design is required to meet the 
performance requirements of future sys¬ 
tems and such a processor concept will 
now be described. 

The STL node processor chip. Consider 
a six-channel adaptive beamformer oper¬ 
ating, for example, at a data rate of 10 
MHz per channel. For such an applica¬ 
tion, the TWAP described previously 
would need to achieve more than 3 
gigaflops at low power and in a small pack¬ 
age. In order to achieve this objective 
(typical of future radar and communica¬ 
tions systems), each node processor within 
the AAPT would have to achieve a 
throughput rate of = 150 megaflops and, 
since one complex multiply requires six 
real operations, this represents an increase 
in performance of = 10 3 . A high- 
performance node processor chip (NPC), 
aimed at satisfying such requirements, is 
currently being designed at STL 11 as part 
of a Ministry of Defence technology 
demonstrator program. It is intended 
primarily as a programmable building 
block for the implementation of real-time 
SAPs or WAPs and will be fabricated in 
1.25 n m CMOS using a process developed 
under the UK Alvey information technol¬ 
ogy program. First samples should be 
available in 1988. The chip architecture 
has been optimized for fast, floating-point 
arithmetic with multiple I/O facilities that 
will allow efficient NPC arrays to be devel¬ 
oped for a wide range of front-end DSP 
tasks such as adaptive filtering and digital 
beamforming where the computational 
load is very high. 

In order to maximize the processing 
throughput for complex arithmetic, the 
chip design includes two floating-point 
multipliers (incorporating parallel mul¬ 
tiplier arrays and exponent-handling logic) 
and two floating-point arithmetic units for 
add, subtract, and format conversion. A 
24-bit floating-point number format 
(16-bit mantissa and 8-bit exponent) is 
used throughout. The reduced “word- 
length” (relative to the IEEE standard) 
allows faster computation (i.e., less gate 
delays) and more compact realization in 
VLSI. However, it also provides sufficient 
resolution with wide dynamic range to 
satisfy the most demanding algorithms 


envisaged (of which adaptive beamform¬ 
ing is typical). In order to simplify the 
implementation of algorithms on the 
NPC, a single-cycle instruction format has 
been defined for a simple set of arithmetic 
commands. These correspond to the 
necessary floating-point multiply, add, 
and subtract operations, in addition to for¬ 
mat conversion (from fixed to floating¬ 
point and vice versa). An external lookup 
table can be used to store reciprocals 
(primarily for division), square root, etc. 

Multiple data paths and a novel switch¬ 
ing arrangement allow data to be trans¬ 
ferred rapidly between functional blocks 
in the node design. In addition, three 
24-bit parallel input ports and three simi¬ 
lar output ports enable data to be supplied 
to and extracted from the NPC at a rate 
commensurate with the internal process¬ 
ing capability. These I/O facilities are 
sufficiently comprehensive to accommo¬ 
date many types of SAP or WAP architec¬ 
ture for high-performance DSP. Dataflow 
within any NPC array can be regulated by 
bidirectional handshaking associated with 
each data path. As discussed previously, 
this approach resolves many of the timing 
and clock distribution problems often 
encountered with synchronous designs, 
but great care must be taken, once again, 
to ensure that the probability of failure due 
to metastable states is very low. The NPC 
is designed to have a mean time between 
failures of order several years or more, and 
this is felt to be adequate for any realistic 
signal-processing application. Temporary 
data storage has also been included in the 
form of FIFO buffers at all input and out¬ 
put ports of the NPC. This feature enables 
the processes of node computation and 
node-to-node data transfer to be decou¬ 
pled since the processors can deplete or fill 
their FIFO buffers and continue comput¬ 
ing even when their neighbors have halted 
temporarily. As a result, the I/O opera¬ 
tions can be allowed to consume a much 
greater portion of the total node program 
cycle than in the case of a fully syn¬ 
chronous design. The node chip will, how¬ 
ever, be capable of operating under the 
control of a globally distributed clock if 
synchronous operation is preferred. 

The NPC design includes 64 words of 
writable microprogram control store and 
three independent blocks of 64-work data 
RAM. It has a degree of programmability 
that allows many DSP algorithms to be 
implemented efficiently. Typical applica¬ 
tions include recursive filtering, fast Fou¬ 
rier transforms (FFTs), adaptive data 
correction, and recursive least squares 
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Figure 8. Architecture of bit-level systolic correlator. 


processing. In general, the algorithms 
required may be implemented using a pro¬ 
gram sequence of less than 10 instruction 
cycles per NPC. For example, the “inter¬ 
nal” node algorithm for an adaptive 
beamformer can be executed efficiently 
using a sequence of six pipelined instruc¬ 
tion cycles and the radix-2 FFT butterfly 
using only three. The total program mem¬ 
ory available in each node will permit 
several such sequences (or macrofunc¬ 
tions) to be stored simultaneously and 
selected by external control. Alternatively, 
each NPC could be used to implement one 
more complicated DSP subroutine such as 
an 8-point FFT. 

Compared with currently available DSP 
chips, the NPC will have much less pro¬ 
gram and data memory but a much more 
powerful arithmetic capability and multi¬ 
ple parallel I/O ports. As a result, each 
NPC will have a processing capability of 
several tens of megaflops. It is not 
designed to implement complete 
algorithms on an entire array of data but 
to perform a relatively simple function at 
one of the nodes in an array of processors. 
In effect, the NPC is intended to provide 
for parallel DSP what the INMOS Trans¬ 


puter provides for general parallel com¬ 
puting. Since DSP architectures tend to be 
comparatively simple and regular, it will 
be less sophisticated than the Transputer 
in terms of software control but much 
more powerful in terms of its parallel com¬ 
munication and high-speed floating-point 
arithmetic capabilities. 


Bit-level systolic arrays 

In this section, we describe some of the 
UK work on bit-level SAPs. The main 
thrust of this activity has been to develop 
a state-of-the-art design capability for UK 
industry to exploit in the development of 
new DSP systems. A major part of the 
program has been the development 
(mainly at GEC/MEDL) of a range of new 
VLSI signal-processing chips that exploit 
the SAP concept at bit level. Bit-level SAP 
chips have now been developed for a num¬ 
ber of specific DSP operations, including 
convolution and correlation, computation 
of the discrete Fourier transform (DFT), 
and rank-order filtering. Some of these 
will now be described. 


Convolution and correlation. The y'th 
output of an N point or correlation process 
may be written as 

y = X a.x J± . (3.1) 

where a 0 , a x . . . a N _ , represent a fixed 
set of coefficients and x, (/' = 0,1,2 . . .) 
represents a sequence of input data (signal) 
values. In Equation (3.1) the minus sign 
applies to convolution whereas the plus 
sign corresponds to correlation. If a, and 
Xj are both /7-bit binary numbers, then the 
pth bit of the output word yj may be 
expressed in the form 


y p .= £ X a l xP ±1 + carr * es (3-2) 


This computation is usually carried out by 
performing the inner summation for each 
value of i (i.e., performing the multiplica¬ 
tions explicitly) and accumulating the 
results. Most of the conventional DSP sys¬ 
tems based on ripple-through multipliers 
operate in this manner. This approach is 
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Figure 9. Bit-level systolic correlator chip. (Reproduced by permission of the GEC 
Hirst Research Centre and Marconi Electronic Devices Ltd.) 


particularly easy to understand and can be 
pipelined right down to the bit level. 3 
However, the computation in Equation 
(3.1) can be performed in other ways, and 
these give rise to alternative VLSI chip 
architectures that have been adopted for 
many of the bit-level SAP circuits devel¬ 
oped to date. For example, if we expand 
Equation (3.1) in terms of individual bits 
in the coefficient a„ it takes the form 

- 

N 1 ,= ° ,= ° (3.3) 

i r ~K' x j ±i 

It follows that an alternative approach to 
implementing convolution or correlation 
is to develop a processor in which the mul¬ 
tibit words Xj are multiplied only by 
single-bit coefficients. The full result may 
then be obtained using n of these bit-slice 
processors in parallel. A bit-level systolic 
array for implementing the bit-slice func¬ 
tion is illustrated in Figure 8. It comprises 
an orthogonally connected array of gated 
full adder cells, each column being 
associated with an individual (single bit) 
coefficient. Bit-parallel data words enter 
the array from the left in a bit-staggered 
manner and move to the right. The result 
words move in the opposite direction. 3 

A bit-slice digital correlator chip based 
directly on this circuit architecture was 
designed at the GEC Hirst Research 
Centre. 12 and has been developed into a 
commercial product by MEDL. 4 A pho¬ 
tograph of the chip is shown in Figure 9. 
It constitutes a 64-stage correlator/con¬ 
volver (i.e., N = 64) in which the reference 
coefficients take the value 1 or 0. The 
coefficient may be updated in real time 
since new values can be input to the circuit 
on every clock cycle. The chip has been 
designed to handle 4-bit input data 
although the words are sign-extended to 10 
bits within the device in order to accom¬ 
modate the maximum range of any result. 
The circuit therefore requires a 64 x 10 
array of cells that have, in fact, been laid 
out as two separate 32 X 10 blocks. The 
chip has a wide range of applications, 
including time-delay measurements, finite 
impulse response (FIR) filtering, pattern 
matching, adaptive filtering, beamform¬ 
ing, pulse compression, and spread spec¬ 
trum communication. It was initially 
designed with the general military radar 
and communications market in mind and 
was therefore fabricated in 3-pm 
CMOS/SOS, a major advantage of this 


technology being its radiation hardness. 
The chip incorporates approximately 
43,000 transistors and occupies an area of 
7 mm x 7 mm. It can be clocked at data 
rates up to 35 MHz and consumes less than 
250 mW at 20-MHz, 5-volt operation. The 
50 percent duty cycle implicit in the archi¬ 
tecture of Figure 8 is avoided in practice by 
using half-latches and clocking adjacent 
cells on alternative phases of the clock. An 
important feature of the chip is that it 
incorporates all circuitry required to for¬ 
mat the data (e.g., skew the input and skew 
the output). Additional circuitry is also 
provided so that several chips may be cas¬ 
caded to increase the number of correla¬ 
tion stages, the input data word length, or 
the number of coefficient bits without any 
additional logic (or “glue” chips). 
Extremely powerful systems can therefore 
be constructed as highly regular arrays of 
bit-level systolic components. 

Another approach to convolution or 
correlation is to sum the terms in Equation 
(3.2) in the opposite order to that which is 
normally adopted; that is, for a fixed value 
of q perform the summation 


y p f= N 'l a l x r±i (3 - 4) 

The p lh bit of the result can then be 
formed by carrying out the final sum over 
q. This approach has been used in the 
design of a bit-level systolic con¬ 
volver/correlator with multibit input data 
and coefficients. The architecture of this 
device, 13 which is illustrated in Figure 10, 
has gone through several design phases in 
order to produce a circuit both simple and 
100 percent efficient in terms of cell utili¬ 
zation. It incorporates the suggestion of 
Urqhart and Wood 14 that the coefficients 
remain on fixed sites, with each cell in the 
r th row storing one bit of the coefficient 
a r _ i. The circuit also features unidirec¬ 
tional dataflow, which is extremely impor¬ 
tant since it permits the use of extra delays 
when driving signals from chip to chip or 
bypassing a faulty row of cells if a fault- 
tolerant design is required. 13 Data words 
enter this array through the top right-hand 
cell (least significant bit first) and are 
clocked from right to left. Once a given bit 
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Figure 10. Architecture of systolic multibit convolver. 


has traversed any row, it is delayed for 
several cycles (in this case three) before 
being input to the row below. As they 
move through the array, these data bits 
interact with the coefficient bits on each 
cell to form partial products of the form 
atxa p - < j ±i . Each partial product is passed 
to the cell below on the next clock cycle. 
The net result is that the sum over i of all 
partial products in Equation (3.4) is 
formed within a parallelogram-shaped 
interaction region that moves down 
through the array. The final result is 
formed by summing over q (i.e., 
accumulating all partial products of the 
same significance within the interaction 
region) using two extra rows of cells at the 
bottom of the main array (not shown in 
Figure 10). The design of a commercial 
convolver chip based on this architecture 
has recently been carried out at MEDL. 
This particular device, which should be 
available toward the end of 1987, can be 
configured internally to operate as a 
16-stage FIR filter with 16-bit coefficients 
and 16-bit input data or as two separate 
16-stage filters with 8-bit data and coeffi¬ 
cients that can be cascaded internally to 
form one 32-stage device. It comprises a 16 
x 16 array of cells and can be clocked at 


more than 20 MHz with a power require¬ 
ment similar to that of the correlator chip 
(i.e., less than 0.5 W). Since the chip oper¬ 
ates in a bit-serial manner, the output rate 
depends on the input word length, a typi¬ 
cal example being 1 megaword per second 
output for 16-bit input data. The chip can 
also be configured to perform inner prod¬ 
uct computations and so it may be used as 
a building block for matrix x vector or 
matrix x matrix multiplication circuits 
and also in various pattern-matching 
applications. 14 

Discrete Fourier transform. A highly 
regular bit-level SAP for computing the 
DFT was proposed by Ward et al. 15 It 
exploits Winograd’s Algorithm, which 
allows the DFT (y 0 , Y, . . . . Y N _ i of a 
sequence of N data points (x 0 , 
X\ .... x N _ i) to be written in the form 

Y = C(Ax(g)Bz) (3.5) 

where A is an M x N matrix and C is an 
N x M matrix. The product Bz is precal¬ 
culated and given as a set of M coeffi¬ 
cients. These coefficients are either purely 
real or purely imaginary—never complex 
numbers. The symbol (g) represents point- 


wise multiplication of the two vectors, the 
M individual products corresponding to 
the reduced number of multiplications in 
Winograd’s Algorithm. In almost all cases 
of interest, the A and C matrices in 
Winograd’s DFT algorithm contain ele¬ 
ments that only take the values 0 or ±1. As 
a result, the computation in Equation (3.5) 
requires only multiplication of a data vec¬ 
tor x by a matrix of elements 0 or ±1 fol¬ 
lowed by a pointwise multiplication and 
the multiplication by another matrix 
whose elements are wither 0 or ±1. 

A bit-level SAP, which can be con¬ 
figured to implement the A and C matrix 
x vector multiplication for an 8-point 
transform, has recently been designed by 
Ferranti Computer Systems Ltd based on 
the bit-serial, word-parallel architecture 
outlined in Figure 11. It is estimated that 
this device will operate at clock rates in 
excess of 20 MHz and consume less than 
1 W using the Ferranti differential logic 
bipolar technology. It comprises an 8 x 8 
array of cells and uses the equivalent of 
5000 gates. The chip has been configured 
as a highly regular building block for use 
in the construction of larger transform cir¬ 
cuits, as described in Ward et al. 15 An 
8-point Winograd DFT based on this 
design would require six chips (assuming 
that one chip is needed for each set of M 
multipliers). At 20 MHz it is capable of 
processing 8.4 x 10 6 complex words per 
second, thereby computing one 8-point 
transform every 0.95 ps. 

Rank-order filter. All the bit-level SAPs 
described above were designed to perform 
linear algebra computations. However, 
bit-level SAPs have also been used to 
implement nonlinear functions, good 
examples being the median filter and rank- 
order filter (ROF) circuits developed at the 
GEC Hirst Research Centre. 4 The ROF, 
which is a generalization of the median fil¬ 
ter, has properties that render it particu¬ 
larly useful for image processing. It 
operates as follows. A one-dimensional 
window W of length k is applied to a 
stream of pixels as illustrated in Figure 12. 
As the window moves along, every pixel 
appears once in each of the A: different win¬ 
dow positions and its rank at each stage is 
defined by ordering all pixels within the 
window into a sequence of descending 
intensity. This process can be carried out 
recursively by determining the rank r u of 
pixel Xj in the first window IF, (rank 
evaluation) and modifying it as appropri¬ 
ate in the subsequent windows W i+i to 
fVi+k-i (rank modification). Whichever 
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pixel has the required rank at any stage is 
transferred to the output stream (rank 
extraction). As described in more detail 
elsewhere, 4 these operations may be per¬ 
formed using separate SAPs that can be 
combined to form a completely pipelined 
system, as shown in Figure 13. Cells 1,2, 


dimensional ROF of variable window 
length (programmable up to 15 pixels), is 
designed to handle 10-bit (unsigned mag¬ 
nitude) input and output data, and can 
achieve a maximum throughput rate of 20 
megasamples per second. The rank of the 
filter can also be programmed externally, 
and this allows it to function, for example, 
as a minimum, maximum, or median fil¬ 
ter. The device may also be used to imple¬ 
ment the type of separable two- 
dimensional filter required for many 
image-processing applications. In this 
case, the one-dimensional filter is first 


and 3 in this diagram correspond to the 
basic operations of rank evaluation, mod¬ 
ification, and extraction, respectively, and 

define the operation of the array at word Result 

level. Operation of the circuit at bit level parallelogram 

may be deduced from the word level cell 
functions given in Figure 13 by represent¬ 
ing each of the input pixel values as a (stag¬ 
gered) bit-parallel number. The rank “Guard band” of sign- 

evaluation array, for example, compares extended input data 
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Figure 11. Architecture of Winograd matrix X vector multiplier. 


Figure 12. A one-dimensional pixel window of length five. 
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Figure 13. Architecture of a one- dimensional systolic rank-order filter. 
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Figure 14. Systolic rank-order filter chip. (Reproduced by permission of the GEC 
Hirst Research Centre and Marconi Electronic Devices Ltd.) 


applied to each row of the image and then 
to each of the resulting columns. The ROF 
chip has a wide range of potential applica¬ 
tions. These include real-time image 
processing for video communications sys¬ 
tems, computer vision and robotics, con¬ 
stant false alarm rate (CFAR) radar 
detection, and filtering for digital AM 
radio receivers. 

Discussion. It has been possible in this 
article to describe only a few of the DSP 
operations that may be implemented as 
bit-level SAPs. The list of other applica¬ 
tions is numerous and includes, for exam¬ 
ple, the discrete cosine transform, the 
Walsh Hadamard transform, HR filtering, 
and vector quantization. 

The devices fabricated to date provide 
hard evidence that the bit-level SAP archi¬ 
tecture has a number of important advan¬ 
tages for high-performance VLSI chip 
design. Since the arrays are completely 
regular, the task of designing an entire 
VLSI chip is essentially reduced to 
optimizing the design and layout of one or 


two very simple cells. It is also very easy to 
carry out a functional validation of the 
array using a hardware description lan¬ 
guage such as ELLA or VHDL, less than 
100 lines of ELLA code being sufficient, 
for example, to describe the convolution 
and correlation circuits described earlier. 
As a result, the typical design time for a 
bit-level SAP is very much less than that 
required for a random logic circuit with the 
same number of transistors. It is important 
to realize that bit-level SAPs are not just 
regular in the geometric sense; they also 
exhibit complete electrical regularity by 
virtue of the fact that each cell is connected 
only to its nearest neighbors. The cor¬ 
responding absence of long interconnects 
also enhances the circuit performance by 
reducing parasitic capacitance (and hence 
RC time delays) and leads to extremely 
high circuit packing densities. For exam¬ 
ple, the correlator chip shown in Figure 9 
has four times as many transistors per unit 
area as most conventional circuits fabri¬ 
cated to date using the same process. 

Since the circuits described above have 


been pipelined right down to the bit level, 
their throughput rates depend only on the 
propagation delay through a single cell 
(typically of order three to four gate 
delays). As a result, we have been able to 
achieve, with a relatively conservative 
CMOS process, data rates comparable to 
those normally associated with bipolar 
technology but at much reduced power 
consumption and higher levels of integra¬ 
tion. In fact, we have been able to achieve 
functional throughput rates of 5 x 10 11 
to 10 12 gate Hz/cm 2 with a 3.0-pm pro¬ 
cess, and this exceeds the phase 1 VHSIC 
target for 1.25-^m devices. Although most 
of the emphasis in this and previous sec¬ 
tions has been on pipelining circuits right 
down at the single bit level, it should be 
noted that for some circuit technologies 
where the propagation delay and physical 
size of a latch is comparable with that of 
the processing element, it may be better to 
pipeline the circuits at a higher level. In 
such cases, the basic cell would be designed 
as the logical equivalent of a 2 x 1, 2 X 
2, or larger array of single-bit cells but 
without any internal latches. It is possible 
in this way to retain the VLSI design 
advantages of a bit-level SAP without 
compromising the cell design in terms of 
the optimum area, speed, and power con¬ 
sumption for a given technology. Also, in 
situations where input values such as the 
coefficients for a convolution or correla¬ 
tion are expected to remain constant for 
many clock cycles, it may be better to 
broadcast them across the array and thus 
reduce the number of latches. 

A lthough the SAP approach to 
chip design eliminates most of 
the global connections within a 
circuit, it has been argued that many of the 
problems associated with the interconnect 
lines have simply been transferred to the 
clock. This may be true to some extent. 
However, since the problem has been cen¬ 
tralized, the circuit designer can concen¬ 
trate his attention on the problems of clock 
distribution, confident that the function of 
the array is otherwise determined by that 
of a single cell. With a more conventional 
circuit architecture it is essential to mini¬ 
mize clock skew over the entire chip. In a 
fully systolic design, where processors 
communicate only with their nearest 
neighbors, it is only the incremental skew 
between processors that must be 
minimized, and this can be kept small 
through careful design and by running all 
clock lines in metal. In most of the chips 
designed to date, the master clock is bussed 
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from an input pad along the bottom or top 
of the chip, the clock signal to each row or 
column being driven by its own clock 
buffer. Using this type of scheme, clock 
skews of less than 1 ns have been achieved 
across entire rows or columns of the 
CMOS/SOS chips described above. 12 

Some doubt has been expressed con¬ 
cerning the integration of bit-level SAP 
devices into systems that also use more 
conventionally designed components. To 
date, we have not found this to be a prob¬ 
lem. Careful design of chips such as the 
correlator (Figure 9) has rendered them 
very easy to use. For example, skewing and 
deskewing latches (which represent a very 
small overhead) have been incorporated 
on chip together with any extra circuitry 
required to build multichip systems. It is, 
of course, essential in any design to take 
account of the latency associated with an 
internally pipelined component. MEDL’s 
recently announced “signal stream” chip 
family, 4 designed to provide many of the 
components required for high-speed DSP, 
includes the bit-level systolic correlator, 
the ROF, and several nonsystolic chips 
such as a video line buffer, cascade ALU, 
and a one- or two-dimensional convolver. 
These systolic and nonsystolic elements 
may be used together to generate com¬ 
pound high-performance systems without 
the need for any “glue” chips. □ 
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Jacob A. Abraham, Prithviraj Banerjee, Chien-Yi Chen, W. Kent Fuchs, Sy-Yen Kuo, 
and A. L. Narasimha Reddy 
University of Illinois 


This article describes 
various techniques for 
fault tolerance that 
can be applied to 
systolic array 
architectures. 

The approach of 
algorithm-based fault 
tolerance is shown to 
be the natural one for 
such systems. 


D igital systems that are operated 
in applications where there is a 
high cost of failure require high 
reliability and continuous operation. Since 
it is impossible to guarantee that portions 
of a system will never fail, such systems 
need to be designed to tolerate failures of 
the system components. The discipline of 
fault-tolerant computing is, therefore, one 
which has attracted a great deal of research 
interest. Researchers have attempted to 
derive highly effective and, at the same 
time, efficient techniques to tolerate 
failures in complex digital systems. 

The high computation needs of many 
applications can now be met through the 
use of highly parallel special-purpose sys¬ 
tems that can be produced very cost effec¬ 
tively through the use of very large scale 
integration (VLSI) technology. Systolic 
arrays, such as the ESL systolic array 1 
and the Carnegie Mellon Warp proces¬ 
sor, 2 are examples of such systems. This 
article deals with techniques that can be 
used to achieve fault tolerance in such sys¬ 
tolic arrays. It will provide an overview of 
classical fault tolerance techniques, discuss 
their applicability to this particular prob¬ 
lem, and discuss new, very efficient, fault 
tolerance techniques ideally suited for such 
highly parallel systems. 

A system is said to have failed when it 
no longer provides the service for which it 
was designed. 3 The manifestation of a 


failure will be the errors produced by the 
system. The cause of an error is denoted as 
a fault within the system. Faults may be 
permanent or transient; a permanent fault, 
of course, will not necessarily produce 
errors for all system inputs or states. Fault- 
tolerant computing thus deals with tech¬ 
niques designed to prevent errors at the 
output of a system. Faults may be inher¬ 
ent in the specification or design of a sys¬ 
tem, or may have physical causes, being 
introduced during manufacture of the sys¬ 
tem or due to wear-out in the field. Toler¬ 
ance of design and specification of faults 
is an important area of study, and prob¬ 
lems in preventing faults in hardware and 
tolerating faults in software have been 
studied. This area is, however, beyond the 
scope of this article, which will consider 
only the problem of tolerating physical 
failures. 

Any fault tolerance technique is 
designed to tolerate a given class of faults 
within a system. A fault can be treated at 
any level within the system—from a very 
low level, such as a transistor level, to a 
higher, module level. Most fault tolerance 
techniques have been designed to tolerate 
faults in some module within a system. 
Such a module-level fault model is ideal 
for VLSI, where a physical failure can 
cause some portion of a chip to be faulty. 
A module fault is assumed to result in arbi¬ 
trary errors at the output of the module, 
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and it is these errors that must be prevented 
from appearing at the end of the system 
computation. 

Fault tolerance involves the following 
steps: 

(1) detection of an error (due to a fault) 
at some module output; 

(2) correction of the error; 

(3) identification of the faulty module; 
and 

(4) reconfiguration of the system to 
bypass the faulty module. 

The detection and correction can be 
achieved in many different ways. For 
example, checks on the computation can 
detect an error, and the computation can 
be rolled back to a previous state in order 
to correct the error. This is done in com¬ 
mercial systems such as the AT&T ESS 
and Tandem. On the other hand, if there 
is sufficient redundancy in the computa¬ 
tion, an error due to a module failure can 
be masked by the correct values from other 
modules; this technique is used in highly 
redundant systems for space applications. 
Reconfiguration, of course, requires that 
the faulty module be identified first. In this 
article, we will initially address techniques 
to detect and correct errors and to identify 
faulty modules; the reconfiguration prob¬ 
lem will be discussed at the end of the 
article. 

Classical fault tolerance techniques. 

Many techniques have been proposed over 
the years to achieve fault tolerance. These 
have usually been general techniques that 
can be applied at the module level in a sys¬ 
tem. The generality makes them applica¬ 
ble to most problems, but they also suffer 
from the requirement of a large amount of 
overhead. This section will outline the var¬ 
ious classical techniques and examine their 
applicability to highly parallel computing 
structures. 

Masking redundancy. In this approach, 
which is now known as TV-tuple modular 
redundancy, TV copies (TV odd) of a mod¬ 
ule and a majority voter are used to mask 
the errors from failed modules. This tech¬ 
nique can be combined with a set of spares 
through the use of a disagreement detec¬ 
tor and a switching unit to produce what 
is known as a hybrid redundant system. 

At least three modules are necessary in 
a voting system (typically called triple 
modular redundancy). The overhead for 
fault tolerance is, thus, at least 200 per¬ 
cent, without counting the cost of the 
voter, which can be quite complex. The 


technique, however, is very general and 
can be applied at any level in a highly par¬ 
allel system. In a multiple processing unit 
systolic array, for example, the technique 
can be applied by triplicating each of the 
systolic cells and having a voter at the out¬ 
put of the cells, or by triplicating the entire 
array with one voter. The cost of the fault- 
tolerant system is very high in both cases. 

Concurrent error detection (CED). A 
lower cost fault tolerance technique is to 
design the system to produce an indication 
of errors in the computation during nor¬ 
mal operation. This must, of course, be 
followed by further steps that will identify 
the faulty unit and also provide correction 
of the errors. Techniques utilizing redun¬ 
dancy in space, as well as in time, have 
been proposed and they will be briefly 
described below. Further along in the arti¬ 
cle, we show that the combination of space 
and time redundancy can lead to a very 
attractive form of fault tolerance. 

The earliest form of the error-checking 
technique was the use of error-detecting 
and correcting codes. It is quite a difficult 
problem to incorporate error detection 
and correction capabilities into general 
functional modules. This is because sim¬ 
ple parity-based encodings are not 
preserved under computations such as 
arithmetic operations. The methodology 
of using self-checking circuits for on-line 
error detection in computers was first pro¬ 
posed by Carter and Schneider, and has 
led to a large amount of research on the 
topic. 4 

Many of the classical techniques may be 
readily applied to systolic array architec¬ 
tures. However, the overhead in hardware 
or in performance may be too high for 
many of these techniques to be used in 
practice. The busing structures and mem¬ 
ories are best protected through coding 
techniques. The systolic array processor, 
with which we are concerned in this arti¬ 
cle, can be made fault tolerant, with low 
overhead, using some new techniques 
described in the following sections. 


Error detection and 
correction using time 
redundancy 

The simplest type of self-checking sys¬ 
tem is one which involves duplication of a 
functional module. Clearly, any error due 
to failure in one module will be detected by 
an equality checker. The overhead 


required by self-checking systems using 
hardware redundancy and duplication is 
over 100 percent for error detection, and 
further hardware and time overhead is 
necessary for error correction. Custom- 
tailored self-checking designs can have 
lower overhead. In some cases, if perform¬ 
ance is not the major bottleneck, time 
redundancy can be used instead of hard¬ 
ware redundancy in order to achieve on¬ 
line error detection and correction with 
very low overhead. An example of a time 
redundancy technique for arithmetic oper¬ 
ations is called recomputing with shifted 
operands. 5 In this technique, the arith¬ 
metic operation is repeated after shifting 
the operands so that each cell in the arith¬ 
metic unit operates on a different set of 
bits. It can be shown that, with appropri¬ 
ate shifts and design of the arithmetic unit, 
a faulty cell will cause the two results to be 
different, achieving error detection. The 
overhead needed is only that of the shifters 
and the equality checker. Error correction 
can be achieved with this scheme by using 
further cycles. Related techniques using 
time redundancy include triple-time 
redundancy and alternating logic; how¬ 
ever, the latter technique does require rela¬ 
tively high hardware overhead. 

Several techniques have been proposed 
in the literature for using time redundancy 
to achieve error detection with low hard¬ 
ware overhead. In the design of recon- 
figurable systolic arrays having cell bypass 
mechanisms (using extra links), a tech¬ 
nique for error detection using duplication 
and comparison has been proposed. 6 The 
drawback of this technique is the relatively 
high hardware overhead. 

Another technique is based on the fact 
that some systolic arrays will have the com¬ 
putational activity (or wavefront) in one 
cell at a given time instant. This computa¬ 
tion will take place in cell / from time t = 
i - 1 to t = i; there is, hence, an inherent 
redundancy in the computations. This is 
exploited for concurrent error detection in 
the comparison with the concurrent redun¬ 
dant computation. 7 Two computations 
are launched in a way that they are per¬ 
formed on different regions of the array. 
Figure 1 shows the implementation with 
one extra cell; this will achieve the same 
input in cells /' and / - 1 in a time period. 
The outputs of the two cells (which per¬ 
form the redundant computation) can be 
compared using the logic shown in Figure 
2 . 

A technique for using the idle process¬ 
ing element in a systolic array to achieve 
error correction has also been proposed. 8 
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Figure 1. Concurrent redundant computation design. 


This technique uses approximately 3/2 w 
cells (where w is the number of cells in the 
nonredundant system) and produces three 
copies of the computed results. These 
results can be used to correct the errors and 
also identify the faulty unit. 

Although time redundancy is a power¬ 
ful technique that, in general, requires only 
a small amount of hardware overhead, it 
has a fundamental limitation when applied 
to high-performance systolic arrays. Many 
of these systems are designed for high- 
throughput signal-processing applica¬ 
tions, and the 100 percent degradation in 
throughput required by the time redun¬ 
dancy techniques may adversely affect the 
system performance. The techniques 
described in the following section can be 
used to achieve error detection rapidly 
using a very small amount of hardware 
overhead, and the error correction can be 
done using time redundancy. This balance 
of space, hardware, and time redundancy 
achieves the best possible utilization of the 
system modules, since normal perform¬ 
ance is not sacrificed for error detection 
and the overhead for error correction is 
needed only after an error has occurred, a 
relatively rare event in practice. 


Algorithm-based error 
detection and fault 
location 

In the context of systolic processing sub¬ 
systems targeted to perform special- 
purpose computations, it is often possible 
to derive some special techniques for 
achieving error detection or fault tolerance 
at low cost. Algorithm-based fault toler¬ 
ance is a novel technique pioneered by the 
researchers at the University of Illinois to 
deal with precisely those issues: designing 
schemes for fault tolerance specific to the 
algorithms being executed. 
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Figure 2. Logic in concurrent redundant computation. 


Conventional data encoding is done at 
the word level in order to protect against 
errors affecting bits in a word. Since a 
faulty module could affect all the bits of 
a word it is operating on, we need to 
encode the data at a higher level. This can 
be done by considering the set of input 
data to the algorithm and encoding this 
set. The original algorithm must then be 
redesigned to operate on this encoded data 
and to produce encoded output data. The 
redundancy in the encoding would enable 
the correct data to be recovered or, at least, 
to recognize that the data are erroneous. 

Here, we briefly review some of the 
algorithm-based error and fault detection 
schemes that are appropriate for systolic 
arrays. 

Matrix-matrix multiplication. We first 
motivate the application of an algorithm- 


based checking technique by an example: 
the multiplication of two N X Amatrices. 
In the checksum encoding, an extra row 
and an extra column is appended to the 
original matrix. These are the sums of the 
elements of the columns and rows, respec¬ 
tively. 9 After the matrix-matrix multipli¬ 
cation is performed, the result matrix also 
preserves the checksum property. If there 
is an error in the result matrix element (ij), 
it will be identified by verifying the equal¬ 
ity of the sum of the row elements with the 
checksum for row /, and by verifying the 
equality of the sum of the column elements 
with the checksum for column/. Once the 
erroneous element is identified, the correct 
element can be reconstructed by taking the 
sum of all elements of that row (column) 
except the erroneous element, and sub¬ 
tracting this sum from the row (column) 
checksum. This is illustrated in Figure 3, 
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Figure 3. Checksum matrix multiplication on an orthogonal array. 


which shows a 5x4 row checksum 
encoded matrix multiplied by a 4x5 
column checksum encoded matrix on a 
5x5 processor array having row and 
column broadcasting capability to pro¬ 
duce a 5 X 5 full checksum matrix to detect 
and correct errors during matrix multipli¬ 
cation on an orthogonal array. 

Matrix-vector multiplication. The mul¬ 
tiplication of a matrix by a vector can be 
performed on a linear array of processors. 
Jou and Abraham proposed a weighted 
checksum encoding scheme 10 to detect 
and locate a faulty processor concurrently. 
Given a vector, (a, a 2 . . . a„), two check 
elements fVCS\ and WCS2 are appended 
to the vector where 

WCSl = £ a. and WCS2 = 2V*2'~ 1 

The code can be extended to matrices by 
forming the code for each row and column 
of the matrix. After the matrix-vector mul¬ 
tiplication is performed, the result vector 
also preserves the weighted checksum 
property used to correct errors. 

Matrix equation solvers. A matrix equa¬ 
tion of the form Ax = b can be solved by 


factoring the matrix A into a product of a 
lower and an upper triangular matrix L 
and U. Improvements of the checksum 
schemes have been developed by Luk 11 
for solving such a system of equations with 
a full-weighted checksum matrix. The use 
of unique weights Wj = j was proposed to 
reduce numerical difficulties with large 
weights. A full checksum matrix is fac¬ 
tored with the assumption that there is one 
error of size a in the (ij) position of A that 
can be represented as A = A -aepf, 
where e, represents a vector of all zeros 
and a one in the fth position. The LU fac¬ 
torization of this matrix can be written as 
A = LU. The correct LU factorization can 
be obtained from A = L(U + pe/), 
where p satisfies Ip = ae,. 

Concurrent error detection using linear 
property encoding. Concurrent error 
detection (CED) can be easily incorpo¬ 
rated into systolic arrays that satisfy the 
following two conditions (these have been 
found to be features common to many 
numeric systolic arrays): 12 

(1) Each processing element (PE) in the 
repetitive part of the systolic array is itself 
a linear system. 

(2) The coefficient stream passes 
through the array without being modified. 


An example of an infinite impulse 
response (IIR) filter is given as follows. A 
systolic array for an IIR filter defined by 
the difference equation 


y{k)= 2 x(k — m)b(m) + 

2 y(k — m)a(m) 


is shown in Figure 4a, and the function of 
each PE is shown in Figure 4b. This design 
was proposed by S. Y. Rung. 13 He gives 
detailed procedures for transforming a sig¬ 
nal flow graph into a systolic array in his 
article. (Note that there are many other IIR 
filter implementation methods. This 
scheme is also valid for these methods.) 
For the sake of simplicity, in Figure 4 the 
input and output buffers inside each PE 
are not shown. In this example, since each 
PE simply performs some linear combina¬ 
tion, it is easy to show that it is a linear sys¬ 
tem, where the coefficient for each PE is 
/. Moreover, the coefficient stream (i.e., / 
stream) flows through the array from left 
to right without being modified. 

To perform the CED, two new variables 
are added: 

IS' = IS+m + n and OS' = OS + m' 

+»' 

where IS ' and OS' are the updated values 
of IS and OS at each PE, and the IS and 
OS, which enter PE U are both 0. (See Fig¬ 
ure 4c.) Finally, at the right end of the 
array, we use an extra PE (i.e., PE N+ , in 
Figure 4a) to compute 

IS+l (£ a.+ ][ b.) 

and compare it with the OS entering 
PE n+ ]. Any inconsistency will reveal that 
there exists a faulty PE. The reason for 
adding IS and OS as we did above is 
explained below. 

Consider l p , which is a coefficient of 
the / coefficient stream. It is clear to see 
that l p passes through the array without 
being modified. Notice that during the 
propagation l p meets ( m k , n k , a k , b k ) to 
generate (m k , n k ) at PE k . Here, m k and n k 
are used to represent the m-value and n- 
value that l„ meets at PE k ; m k and n k are 
used to represent the m '-value and 
n '-value generated at PE k (when l p is at 
PE k ). It can be shown that when l p arrives 
at the extra PE (at the right end of the 
array), the IS and OS entering the extra PE 
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IS='£(m.+n.) 

OS='£(m '. + «'.) 

= 'k [m i + n i +l p (a i + b fi 

= is + i p i2>,+6)] 



(a) Extra PE 


Thus, in the extra PE, we need only 
compare 

/S + V J> f + XV 

with OS to detect the errors. Since each a, 
or bj is a system parameter that is precal¬ 
culated, loaded into the system, and used 
for a long time without being modified, we 
can precalculate 

<z«, + xv 

and directly apply it to the extra PE. Thus, 
in the extra PE, we only need one adder 
and one multiplier to compute 

IS+l(J_a : + 2j* ( ) 

Notice that since the linear property 
encoding is mainly used to detect the faults 
in the regular part of the array, the adder 
at the left end of the array (see Figure 4a) 
must be duplicated or specially designed to 
ensure that it functions correctly. 

A similar scheme can be applied to 
many other one- and two-dimensional sys¬ 
tolic arrays. The key point in designing 
two-dimensional systolic arrays with CED 
is to suitably partition the array by rows or 
by columns and apply a scheme similar to 
that used in the HR filter to each of the par¬ 
titioned rows or columns. 

QR factorization. QR factorization of 
a square matrix A involves finding an 
orthonormal matrix Q and an upper trian¬ 
gular matrix R such that = QR. QR fac¬ 
torization is frequently used in such 
signal-processing applications as reducing 
a matrix into triangular, tridiagonal, or 
bidiagonal form; finding its eigenvalues or 
singular values; and determining the least 
squares solution of a given problem. By 
triangularization, many matrix problems 
are reduced to the simpler problem of solv¬ 
ing triangular systems. QR factorization is 
also a key step in computing least squares 
solutions by QR decomposition and the 



r=i 

m' = m + a k l 
n' = n + b k l 


computation of eigenvalues by the QR 
algorithm. Givens’ rotation has been pro¬ 
posed for VLSI implementation of QR 
factorization as it requires only local com¬ 
munication and can be easily implemented 
with a square or triangular array of 
processing cells. In this technique, an ele¬ 
ment of a matrix is zeroed out by suitably 
multiplying the matrix by a unitary matrix. 
Consider the operations involved in zero¬ 
ing out element (/,/) of the matrix. The fth 
and y'th rows are updated by the Givens’ 
rotation as 

= Ca,. + Sa r and 
a/. = Set,. - Caj . 

where C 2 + S 2 = 1 and C = a, >•/Ar, S = 
ajj/k, and 



It can be seen that a ',. 2 + a/. 2 = a,. 2 + 
a y . 2 , which shows that the sum of squares 
of the elements of a column (hereafter 
called the 2-norm) remains the same. This 



OS’ = OS + m' + n' 


property of QR factorization can be used 
for error checking. 14 Figure 5 shows a tri¬ 
angular array of processors used for per¬ 
forming QR factorization with on-line 
fault detection capability. The rows of the 
matrix to be QR-factored are fed from the 
top in a skewed order, as shown in the fig¬ 
ure. The processors on the diagonal com¬ 
pute the C,S terms for the Givens’ rotation 
and the processors above the diagonal 
update the row elements, as defined by the 
relations shown above. 

The extra row of processors at the top 
and bottom of the array are basically inner 
product processors and are used for com¬ 
puting the 2-norms of the columns of the 
matrix before and after the QR factoriza¬ 
tion. For comparing the 2-norms of each 
column, we need wraparound connections 
for each vertical level of the array. If the 
2-norms of the columns do not match, 
then we know that a fault has occurred. 
Depending on which of the comparators 
gives the error signal, it is possible to iden¬ 
tify in which column of the array the faulty 


m' = m+a k l 

(b) (c) n' = n + b k l 

Figure 4. A systolic array proposed by S.Y. Rung for an IIR filter (a); the PE used 
in the array in Figure 4a, i.e., in the original (non-CED) design (b); the PE used in 
the CED design (c). 
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Figure 5. On-line fault detection in QR factorization. 


processor lies. Since a faulty processor can 
affect a number of elements to its right and 
below, only the first error signal from the 
left need be considered. 

The fault model in this scheme assumes 
that only one processor can be faulty. The 
nondiagonal processors may have either a 
faulty adder or a faulty multiplier. Simi¬ 
larly, the extra processors can have a fault 
in either the multiplier or the adder. Since 
processors on the diagonal need to com¬ 
pute the square root of a number, the pos¬ 
sible faults depend on the way the square 
root is computed. To simplify the imple¬ 
mentation details, we assume that the 
processors on the diagonal can generate 
faulty C,S terms or they can compute the 
outputs incorrectly. It should be noted that 
the above scheme detects most faults in 
single processors except, for example, 
those that cause sign reversal errors; that 
is, if under failure, an output becomes —N 
instead of +N, that error will not be 
detected. 


Singular value decomposition. Singular 
value decomposition is used for various 
applications in signal processing such as 
determining rank, solving linear equations 
with equality constraints, solving 
homogeneous systems of linear equations, 
computing pseudoinverses, determining 
dependencies among the columns or rows 
of a matrix, and so on. A matrix A is 
expressed as A = UIV r , where U and V 
are orthogonal matrices and I is the 
diagonal matrix diag[a\, a 2 ,. .. , a n 0, 0, 

..., 0] and a,-,/ = 1,2. r are the sin¬ 

gular values and r is the rank of the matrix. 
Recently, some systolic array implementa¬ 
tions have been proposed for implement¬ 
ing singular value decomposition. Singular 
values can be obtained by recursively com¬ 
puting A„ + 1 = Q r ”A„Q„, where A„ = 
Q„R„ is the QR factorization of A„, until 
the matrix A„ + , converges to the diagonal 
matrix 1. Since singular value decompo¬ 
sition can be carried out by repeated appli¬ 
cation of QR factorization, the proposed 


scheme can also be employed here. After 
each QR factorization, the SOS check can 
be applied to see if any errors occurred 
because of hardware faults. After obtain¬ 
ing the QR factorization of the matrix A, 
the Q matrix needs to be fed back into the 
array for postmultiplying the R matrix to 
obtain the next iterate of matrix A. To feed 
the Q matrix back into the array, some 
kind of wraparound connections will be 
necessary. As pointed out 
earlier, these wraparound connections can 
be effectively used to achieve fault loca¬ 
tion. Another CED technique for singular 
value decomposition, which uses check¬ 
sums, has been described by Chen and 
Abraham. 15 

Linear least squares minimization. The 

problem of linear least squares minimiza¬ 
tion involves finding an ^-vector X given 
ap x n matrix A and a p -vector Y so as to 
minimize the square of the norm 
(AX - Y). In filtering and parameter esti¬ 
mation and identification, determining a 
least square solution is widely used. 
McWhirter has proposed a scheme to solve 
the least squares problem on a triangular 
systolic array, which is a generalization of 
the array shown in Figure 5. A notable 
difference is that a vector /? is used for 
weighting the rows. The other difference 
is that the Y vector is simply treated as 
another column of the A matrix during the 
QR factorization and input to the array. 
The rightmost column of the array gives 
the least squares residual at every stage of 
the orthogonalization. The error checking 
scheme needs to be slightly modified to 
recursively compute Z = p 2 Z + X 2 rather 
than Z = Z + X 2 . Hence, the computed 
Zs in each column before and after the 
orthogonalization should match for a 
fault-free computation. 


Reconfigurable VLSI- 
based systolic arrays 

Design strategies for fault tolerance in 
systolic arrays are highly dependent on the 
application environments. Environments 
in which transient errors and intermittent 
failures are dominant may best be served 
by techniques that employ forward error 
recovery through algorithm-based encod¬ 
ing, as described earlier in this article. 
However, applications in which perma¬ 
nent failures are the dominant concern are 
candidates for a reconfiguration ap¬ 
proach, as described here, in which 
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failures are tolerated by replacing the 
faulty processors and interconnect with 
fault-free spares or bypassing the faulty 
elements and reducing the size of the com¬ 
putational structure. Design for reconfig¬ 
uration can be utilized to tolerate failures 
of a systolic array at the point of applica¬ 
tion, as well as enhancement of yield 
through toleration of manufacturing 
defects in a chip or wafer. 16 Successful 
reconfiguration relies on the following 
elements: 

• testing or concurrent error detection 
techniques for diagnosis of failures, 

• a design strategy for placement of 
spare computational elements, 

• appropriate interconnect and switch¬ 
ing techniques for incorporation of 
spares, and 

• reconfiguration algorithms, which 
optimally allocate spares in the pres¬ 
ence of multiple failures. 

Because of space limitations, we must 
here concern ourselves with design and 
algorithms for reconfiguration and not 
with diagnosis of failures. 

An understanding of the design objec¬ 
tives appropriate for reconfigurable sys¬ 
tolic architectures provides a basis for 
evaluation of different reconfiguration 
techniques. One objective is efficient utili¬ 
zation of spares. For example, if the archi¬ 
tecture has k spare processors, then an 
objective may be to tolerate any combina¬ 
tion of k processor failures through recon¬ 
figuration. The algorithms that allocate 
spares in the presence of failures should 
also be of reasonable complexity. In addi¬ 
tion, it is often desirable that failures in 
interconnect and switching structures be 
tolerable through reconfiguration. Objec¬ 
tives for a targeted VLSI implementation 
include manageable VLSI layout complex¬ 
ity for large numbers of processors, 
moderate interconnect requirements, and 
a bounded number of pins per chip for 
multiple-chip implementations. Reconfig¬ 
uration should also not result in significant 
performance degradation such as may be 
due to clock skew from long interconnect 
in the presence of failures. All of these 
objectives must be evaluated through anal¬ 
ysis that includes reliability and/or yield 
studies, and performance evaluation. 

A taxonomy of reconfigurable systolic 
structures. Reconfiguration techniques 
appropriate for systolic arrays can be 
described by the 4-tuple ( D , S, R, T) = 
RA, denoting the reconfigurable architec¬ 
ture, where D is the dimension of the new 


structure after reconfiguration, S 
represents the switching mechanism, R 
denotes replacement strategies for faulty 
elements, and T denotes reconfiguration 
technologies. Each parameter of the 
4-tuple can take on the following values: 

(1) The value of D can be either 1 (fixed 
structure) or 2 (graceful degrada¬ 
tion through a reduction in the 
number of processors). 

(2) The value of S can be 1 (indepen¬ 
dent switches) or 2 (local switches) 
or 3 (bus-structured switching) or 4 
(address renaming). 

(3) The value of R can be 1 (direct or 
local replacement) or 2 (shifted or 
global replacement) or 3 (elimina¬ 
tion of faulty elements without sub¬ 
stitution). 

(4) The value of T can be either 1 (per¬ 
manent, static) or 2 (temporal, 
dynamic). 

For example, RA = (1,2,2,2) denotes 
a reconfigurable architecture that main¬ 
tains its original size after reconfiguration 
and uses local switches that are electrically 
programmable to perform global 
replacement. 

Structure size (D). Reconfiguration can 
be done either by maintaining the original 
size (D = 1) or by reducing the number of 
computational elements (.D = 2) with 
degraded performance. To maintain the 
original size, redundancy has to be 
included in the design so that, once a faulty 
element is located, a spare element is allo¬ 
cated for direct or indirect replacement. 
The advantage of a fixed structure is that 
performance of the network is maintained 
throughout the life of the system. How¬ 
ever, spare elements are usually inactive 
during normal operation, which means the 
utilization of all good PEs is not optimal. 
Both the scheme proposed by Banerjee, 
Kuo, and Fuchs 17 for cube-connected- 
cycles-based networks and that proposed 
by Lowrie and Fuchs 18 for tree architec¬ 
tures maintain a fixed structure through 
switching mechanisms that replace faulty 
elements with spare elements. 

The alternative approach is to assemble 
as many good PEs as possible into the new 
network in the presence of failures. This 
approach typically utilizes redundancy in 
the form of spare interconnect and switch¬ 
ing mechanisms, but not redundant 
processors. The system performance is 
degraded each time faulty elements are 
excluded from the system. This is accom¬ 
plished, for example, in the scheme for 
one- and two-dimensional systolic arrays 


proposed by Kung and Lam, 19 which 
achieves performance degradation by con¬ 
necting as many good processors as pos¬ 
sible without affecting the synchronization 
of data flow. 

Switching techniques ( S ). There are 
three ways to arrange the switches used to 
perform reconfiguration. Reconfiguration 
can also be performed through address 
renaming instead of through switching 
techniques. 

Independent switches (S = 1). Switches 
for performing reconfiguration may be 
separated from the PEs and treated as 
independent elements instead of part of 
the PE. Independent switches typically 
have local memory capable of storing 
several configuration settings, if the setting 
is not permanent. A configuration setting 
enables the switch to establish a direct con¬ 
nection between two or more of its incident 
communication paths. A well-known 
example is the CHiP architecture proposed 
by Snyder. 20 The switches are interspersed 
with PEs to form a processor-switch lat¬ 
tice. An example of a processor-switch lat¬ 
tice is shown in Figure 6. The squares 
represent PEs and the circles represent 
switches. The CHiP approach is unique in 
that both structure and function are recon¬ 
figurable. Reconfiguration for multiple 
functions is accomplished through embed¬ 
ding different networks onto the mesh 
(e.g., binary trees). 

Local switches (S = 2). In a local-switch 
implementation, switches are placed 
immediately around each PE. Local 
switches are usually designed so that infor¬ 
mation entering a faulty PE can be 
directed to one of its neighbors without 
processing. An example of local switches 
is provided by Hassan and Agarwal 21 in 
their scheme for binary tree architectures. 
The local switching scheme requires little 
additional hardware (only a few simple 
switches for each PE). The scheme oper¬ 
ates with local controls without any exter¬ 
nal supervision. The switches can be 
treated as part of each PE so that the lay¬ 
out complexity and degree of connectivity 
of each PE are not changed. The major 
drawback of the scheme is its inflexibility. 

Bus-structured switches (S = 3). This 
approach can be best illustrated by the 
Diogenes design proposed by Rosen¬ 
berg. 22 The essence of the Diogenes meth¬ 
odology is that PEs are in collinear layout, 
with “bundles” of communication paral- 
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Figure 6. Example of a processor-switch lattice. 


lei to the row to which the PEs are con¬ 
nected. One scans along the row of PEs to 
determine which are faulty and which are 
fault-free. As a good PE is encountered, 
it is connected to the bundles of wires 
through a network of switches. This 
dynamic setting of the control switches 
provides the arrays with their reconfigura¬ 
bility. 

Address renaming (S = 4). In this 
strategy, switches are not employed to per¬ 
form the reconfiguration. Instead, each 
processor has a modifiable address with 
redundant processors and links provided. 
Once a faulty PE is detected, addresses of 
the processor are rearranged so that the 
faulty PE is excluded and a redundant PE 
is included. An example is the globally 
reconfigurable cube-connected-cycles 
architectures. 17 This scheme employs a 
redundant cycle of PEs and utilizes a 
global reconfiguration technique to toler¬ 
ate any number of failures of PEs within 
a single cycle. The address renaming 
approach, although being cost-effective by 
having no switches, is applicable only to 
the class of networks not having nearest- 
neighbor interconnections (e.g., binary n- 
cube, cube-connected-cycles, and shuffle- 
exchange networks). 


Replacement strategies (R). Replace¬ 
ment strategies affect ease of implementa¬ 
tion, the selection of switching techniques, 
and the complexity of reconfiguration 
algorithms. Three of the dominant 
replacement strategies are direct or local, 
shifted or global, and processor elimi¬ 
nation. 


Direct or local replacement (R = 1). 
Typically, direct or local replacement is 
achieved by partitioning the network into 
subgroups of PEs with provision for spare 
PEs and links within each subgroup of the 
network. Spare PEs within the subgroups 
can replace a faulty PE in such a way that 
all the PEs originally connected to the 
faulty PE are now connected to the spare 
PE. Examples include the modular binary 
tree 21 and locally reconfigurable cube- 
connected-cycles architecture. 17 In the 
first scheme, each modular block consists 
of four PEs connected by 12 links. The 
PEs are interconnected in such a way that 
at any time any one of the four PEs can be 
faulty and the remaining three PEs will 
form the two-level active subtree for the 
complete binary tree. In the second 
scheme, each cycle of the cube-connected- 
cycles network is provided with multiple 


spare PEs and links for tolerating multi¬ 
ple failures and maintaining the structure 
and external connections in each cycle. 
The advantage of local replacement tech¬ 
niques is that the reconfiguration 
algorithms are usually simple, since each 
subgroup is small, with few spare ele¬ 
ments. However, the reliability or yield 
improvement may be limited due to a lack 
of flexibility in replacement options. 

Shifted or global replacement (R = 2). 
If a spare can be used to replace any fail¬ 
ure in the network, then the replacement 
strategy is global in nature. Usually, this 
is performed by shifting—that is, PE U 
which is one of the neighbors of the faulty 
PE 0 , replaces PE 0 , and PE 2 , which is one 
of the neighbors of PE h replaces PE\ and 
so on until a PE is replaced by a spare PE. 
Although reliability and yield may be bet¬ 
ter than a local replacement technique with 
the same amount of redundancy, the 
reconfiguration algorithms often become 
very complex. One simple example is spare 
allocation in reconfigurable rectangular 
arrays through row and column replace¬ 
ment. 23 This technique is applicable to 
two-dimensional rectangular systolic 
arrays provided that mechanisms are 
included to bypass rows and columns of 
faulty PEs. The spare allocation (reconfig¬ 
uration) problem can be modeled as a rec¬ 
tangular array with M x A cells, SR spare 
rows, and SC spare columns of cells. An 
7x9 array with SR = 2 and SC = 3 is 
shown in Figure 7. Also shown in Figure 
7 is a collection of faulty cells. The recon¬ 
figuration algorithm should select the 
minimum number of spare rows and/or 
columns that cover all the faulty cells. The 
bipartite graph of Figure 8 describes the 
previous example, where each link 
represents a faulty cell. The optimal spare 
allocation problem becomes a bipartite 
vertex covering problem; that is, it 
becomes a matter of finding a set of Rs and 
Cs that cover all the links. Kuo and 
Fuchs 23 have shown that the complexity 
of the bipartite vertex covering problem 
and, therefore, of optimal reconfigura¬ 
tion, is NP-complete. Good heuristic 
algorithms such as those developed by the 
authors for the rectangular array problem 
are needed for most global replacement 
approaches. 

Elimination of faulty elements (R = 3). 
Graceful degradation through a reduction 
in the number of processors, that is, D = 
2, is implemented by simple elimination. 
In order to maintain the same architecture, 
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sometimes fault-free elements also have to 
be eliminated. Kung and Lam 19 proposed 
a systolic fault-tolerant scheme that main¬ 
tains the original data flow pattern by 
bypassing defective cells with delay 
registers. As a result, many of the desira¬ 
ble properties of systolic arrays (such as 
local and regular communication between 
cells) are preserved. They show that the 
problem can be reduced to the problem of 
incorporating extra delays on certain paths 
in originally correct systolic designs. For 
unidirectional linear arrays, this scheme 
has the distinct advantage that all fault- 
free cells can be utilized. This scheme also 
differs from other approaches in that it 
considers the effect of reduced throughput 
due to a slower system clock required by 
the fact that the communication between 
logically adjacent cells can now span an 
arbitrarily large distance. An example sys¬ 
tolic fault-tolerant unidirectional linear 
array is shown in Figure 9. 

Technology (7). Switches for restruc¬ 
turing the network in the presence of mul¬ 
tiple failures can be either permanent 
(static), T = 1, or temporal (dynamic), T 
= 2. Currently available physical recon¬ 
figuration techniques include mask pro¬ 
gramming, laser cutting or connection, 
fuse blowing, and programmable logic. 
All but the last provide static restructuring 
and thus are appropriate for enhancing 
yield, but they usually do not support fault 
tolerance after the system is in operation. 
Programmable switches may be used to 
support both yield enhancement and fault 
tolerance. However, they typically require 
extra circuitry such as latches as well as 
additional control lines, and they may be 
volatile—that is, a loss of power may 
require a reprogramming of all the connec¬ 
tions. Static switches using laser technol¬ 
ogy have been widely used by 
semiconductor memory manufacturers 
for yield enhancement. The laser repair 
system invokes a spare by first disconnect¬ 
ing the faulty bit or word line and then 
selectively programming decoder links 
associated with the spare. 

H ighly parallel systolic arrays are 
especially amenable to low-cost 
fault tolerance schemes that are 
tailored to the algorithm and the array 
architecture. This article has described a 
variety of encoding schemes and architec¬ 
tures that can be used to achieve fault 
tolerance in matrix operations and other 
signal-processing computations. Recon¬ 
figuration techniques applicable to systolic 
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Figure 7. A 7 X 9 array with two spare Figure 8. Bipartite description of the 

rows and three spare columns. reconfiguration problem. 



Figure 9. Systolic fault-tolerant scheme for unidirectional linear arrays. 


arrays have also been described. These 
techniques will allow the insertion of fault 
tolerance with low overhead into the next 
generation of systolic array systems. □ 
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M any scientific and technical 
applications require high 
computing speed; those 
involving matrix computations are typical. 
For applications involving matrix 
computations, algorithmically specialized, 
high-performance, low-cost architectures 
have been conceived and implemented. 
Systolic array processors (SAPs) are a 
good example of these machines . 1,2 

An SAP is a regular array of simple 
processing elements (PEs) that have a 
nearest-neighbor interconnection pattern. 
The simplicity, modularity, and 
expandability of SAPs make them suitable 
for VLSI/WSI implementation. 

Algorithms that are efficiently executed 
on SAPs are called systolic algorithms 
(SAs). An SA uses an array of systolic cells 
whose parallel operations must be 
specified. When an SA is executed on an 
SAP, the specified computations of each 
cell are carried out by a PE of the SAP. An 
SA is a specification of 

• the type of operation performed by 
each cell during each step, 

• the communication pattern among 
the cells, and 
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• the data and control sequences for 
I/O in the boundary cells. 

In an SA, communication between cells is 
regular and local, massive parallelism is 
exhibited, and a relatively low number of 
I/O operations is required. The properties 
of systolic algorithms and architectures 
limit their practical use to that of solving 
only regular, compute-bound problems 
with a high degree of parallelism . 2 
Fortunately, many matrix problems are of 
this type. 

A large set of SAs and SA design 
methodologies 3 has recently been 
published. Most of these SAs require a 
number of cells that depend on some 
dimension of the problem that is to be 
solved. We refer to these algorithms as 
problem-size-dependent SAs. Normally, 
these SAs solve the problem with just one 
pass of the data through the array. More¬ 
over, the array topology is dependent on 
the type of problem to be solved. 

The number of PEs in a real SAP is 
fixed. It depends on factors such as tech¬ 
nology, desired speed, available host-SAP 
communication bandwidth, and system 
cost and complexity. Usually, the problem 
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Figure 1. (a) A problem-size-dependent systolic algorithm for a lower triangular 
system of equations, (b) The SAP communication pattern that results from coales- 
cent mapping, (c) The SAP communication pattern that results from cut-and-pile 
mapping, (d) Operations performed by PEs. 


is too large to allow it to be solved directly 
by processing it on the array. In addition, 
if the system has fault-tolerant 
capabilities, the array is reconfigured in 
cases of PE failure, and the algorithm 
must be restructured to make it executable 
by the smaller array that results from such 
failure. Therefore, the original problem 
must be partitioned into subproblems 
whose individual sizes fit available SAP 
dimensions. It is therefore necessary to 
find general partitioning techniques 
suitable for SAPs and problems of any 
size. 

Let us review here some previous work 
related to systolic partitioning. L. 
Johnsson 4 proposes partitioning 
techniques to aid in performing Gaussian 
elimination and matrix multiplication. R. 


Schreiber and P.J. Kucks 5 present a QR 
factorization that is executed on a modi¬ 
fication of the triangular array proposed 
by W.M. Gentleman and H.T. Kung. D. 
Heller 6 describes partitioning methods 
that can be used in the QR factorization of 
band matrices. Using a trapezoidal exten¬ 
sion of the SAP proposed by W.M. 
Gentleman and H.T. Kung, H.Y.H. 
Chuang and G. He 7 obtain a versatile 
SAP oriented to matrix computations that 
implement the Faddeeva Algorithm. H.D. 
Cheng and K.S. Fu 8 propose a 
partitioning method and a computational 
model that are based on the space-time 
domain expansion approach. And 
recently, D.I. Moldovan and J.A.B. 
Fortes 9 have extended their SAP design 
methodology to include fixed-size SAPs. 


Usually, SAPs are algorithm-specific, 
and some of them are even direct hardware 
implementations of problem-size- 
dependent SAs. A typical application area 
for algorithm-specific problem-size- 
dependent SAPs is real-time digital signal 
processing. 

To achieve wider applicability, it is 
useful to design versatile SAPs, which are 
SAPs that, when attached to a host, can 
execute several algorithms in a problem- 
size-independent manner. 

To derive a versatile SAP, we propose 
the following steps. 

For each problem in the problem set: 

(1) find one or several SAs; 

(2) select a problem-size-dependent SA 
with maximum similarity in cell 
operation and interconnection 
topology to the other SAs selected 
for problems; 

(3) find partitioning algorithms for 
obtaining subproblems, and modify 
the selected SA so that the 
subproblems can be executed on a 
problem-size-independent SAP. 

For the whole problem set: 

(4) integrate the requirements (PE 
operations, interconnections, and 
control) for all the selected 
algorithms so that the versatile SAP 
can be built; and 

(5) improve the versatile SAP with 
respect to speed (for example, by 
making use of pipelined PEs), array 
utilization, and fault-tolerant 
capabilities. 

In the process described above, the 
selection of the SA and of the partitioning 
algorithms is the key to achieving the 
following goals: 

• Maximization of array utilization. We 
define utilization as U = Tfn T n , where 
n is the number of PEs in the SAP, T t is 
the number of cycles needed to solve the 
problem with just one PE, and T n is the 
number of cycles required to solve it on the 
SAP. Once n is fixed, maximizing U is 
equivalent to minimizing problem- 
execution time in the available SAP. 

• Minimization of the complexity of the 
SAP that results from the derivation pro¬ 
cess. Important factors that determine 
complexity are the types of operations 
performed by the PEs and by the different 
classes of required PEs, the bandwidth 
between the SAP and the host, the amount 
of memory needed, the interconnection 
pattern, and the overall control of the 
array. 

In this article, we present a technique for 
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obtaining the partitioning and the trans¬ 
formation of matrix problems; the tech¬ 
nique is designed to minimize execution 
times for big problems in small arrays and 
to introduce very little additional 
complexity into the system. In the article, 
we apply the technique to solving both 
matrix-by-vector multiplication (M*V) 10 
and triangular system equation (TSE) 
problems by means of a one-dimensional 
(ID) SAP; we also apply it to solving 
matrix-by-matrix multiplication 
triangular matrix equations (TME), LU 
decomposition (LU), and other problems 
by means of a 2D SAP. 11 Actually, the 
proposed technique allows one to solve 
any problem on the preceding list on either 
on a ID 12 or a 2D array. 

We select, in Step 2 of the previously 
described process, problem-size- 
dependent SAs proposed by H.T. Kung 
and C.E. Leiserson. 1 We do not explicitly 
consider Step 4; however, implementing it 
is simple because we make appropriate 
choices in Step 2. 

Finally, we make some comments 
related to the use of pipelined PEs. 


Systolic algorithms for 
matrix computations 

Several problem-size-dependent SAs 
can be derived for a matrix problem, 
depending on the speed and direction of 
dataflows as well as on the types of matrix 
substructures (rows, columns, or 
diagonals) that form the data sequence 
that enters or leaves through each I/O link 
of the array. We discuss two types of SAs 
according to their I/O matrix substruc¬ 
tures: band SAs, in which the matrices 
enter or leave by diagonals, and dense SAs, 
in which the I/O substructures are rows or 
columns. We do not consider here hybrid 
algorithms (one type is the class of 
algorithms in which one matrix enters by 
rows or columns and another by 
diagonals). 

In problem-size-dependent band SAs, 
the number of cells is related to the matrix 
bandwidth. When these algorithms are 
executed on an SAP, maximum array utili¬ 
zation is achieved when problems with 
band matrices ( band problems) are being 
solved. In problem-size-dependent dense 
SAs, the number of cells depends on the 
number of rows or columns in the matrices 
involved. Maximum array utilization is 
achieved in the case of dense problems 
(that is, problems with dense matrices). 



When the structure of the matrices 
(band or dense) involved in the problem 
does not fit the type of SA (dense or band) 
that is being used to solve the problem, 
array utilization decreases dramatically. 
Nevertheless, this drawback can be over¬ 
come if the original algorithm is modified. 
For example, Partial Row Translations 13 
have been proposed for the solution of a 
dense matrix-by-vector multiplication 
problem executing a problem-size- 
dependent band SA. On the other hand, 
Systolic Rings 14 are well suited for the 
efficient solution of band problems 
executing problem-size-dependent dense 
SAs. 

Another important factor in SAs is the 
relative direction of dataflows. Accord¬ 
ingly, we differentiate between SAs with 
“data contraflow” and SAs without it. 
Figures la and 2a show, respectively, one 
ID topology and one 2D topology, both 
with data contraflow. For example, by 
changing the direction of the diagonal con¬ 
nections in Figure 2a, we would obtain a 
2D topology without data contraflow. 



Generally, in ID (or 2D) SAs with data 
contraflow, only one of every two (or 
three) consecutive cells is active during 
each step. In these cases, to achieve maxi¬ 
mum array utilization, every PE of the 
SAP would have to execute the computa¬ 
tions of two (or three) consecutive cells. 

In the set of problems considered in this 
article, we distinguish between two 
groups. The first includes M*V and M*M 
multiplication. In the second group we 
have all other problems (TSE, TME, LU). 
Problems in the first group are homogene¬ 
ous-, that is, all the operations to be per¬ 
formed on data are of the same type. This 
fact makes it possible to have only one type 
of cell, which is used to construct a 
homogeneous array. In the second group, 
the problems are nonhomogeneous. For 
example, the operations (division and 
change of sign) that are performed on the 
elements in the main diagonal are differ¬ 
ent from those operations (multiplication 
and addition) that are performed on the 
other elements. For this reason, the array 
may need different types of cells. Also, in 
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Figure 2. (a) A 2D array of cells, (b) The spiral SAP communication pattern that 
results from cut-and-pile mapping, (c) Operations performed by PEs. 
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problems in this group, there is depen¬ 
dency between results. 

For the problems considered in this arti¬ 
cle, we identify certain characteristics for 
band SAs. 

• All data move by traversing the array, 
and each cell performs the same operation 
during all the cycles of algorithm execu¬ 
tion; these facts eliminate PE-control 
requirements. 

• There are several band SAs with and 
without data contraflow for solving prob¬ 
lems of the first group. 

• For the second group, because of data 
dependencies, the band SAs require data 
contraflow; in these cases, all the cells per¬ 
form multiplication and addition except 
one boundary cell that must perform divi¬ 
sion and change of sign. 

We can also identify certain characteris¬ 
tics for dense SAs. 

• For both groups of problems, one of 
the data sets (either the operands or the 
partial results) remains static within the 
cells as the computation proceeds. Conse¬ 
quently, additional control is necessary to 
establish the difference between array load 
(and/or array unload) operations and cal¬ 
culation operations. 

• Data contraflow is not required for 
either group of problems. 

• For problems in the second group, the 
dense SAs require that either all the cells 
(in the case of a ID array) or all the cells 
in the main diagonal (in the case of a 2D 
array) must be able to perform division 
and change of sign, besides addition and 
multiplication. Consequently, the com¬ 
plexity of the array increases. 

In summary, dense SAs require greater 
array complexity than do band SAs, but 
no data contraflow. The absence of data 
contraflow usually produces conditions 
favorable to fault tolerance and to imple¬ 
mentation with pipelined PEs. 14 Efficient 
partitioning for dense SAs leads to the 
usual decomposition into square or rectan¬ 
gular submatrices. 

The implementation of band SAs 
requires lower complexity in the array and 
less control than does the implementation 
of dense SAs. In this article, we use band 
SAs and propose both partitioning tech¬ 
niques that incorporate triangular subma¬ 
trices and the utilization of pipelined PEs. 

Partitioning and DBT 
transformation 

A problem-size-dependent SA defines a 
good space-time mapping between the set 


of computations in the problem and the set 
of cells. If the number of PEs in the array 
is smaller than the number of cells in the 
algorithm, it is necessary to carry out an 
additional space-time mapping. In this sec¬ 
tion we present the SAPs obtained by spa¬ 
tial mapping that are used to execute 
problem-size-independent SAs. We also 
present the partitioning and DBT (Dense 
to Band matrix transformation by 
Triangular-block partitioning) algorithms, 
which serve to transform a problem-size- 
dependent SA—by temporal mapping— 
into a problem-size-independent SA. 

Spatial mapping. The spatial cell-to-PEs 
mapping provides the set of computations 
to be performed by each PE; that is, it 
defines the different types of operations to 
be performed by each element, the array 
interconnection pattern, and the location 
of memory units. A good spatial mapping 
must preserve the topological properties of 
the SA. To illustrate this, let us suppose 
that we want to solve a triangular system 
of equations: LX = B, where L is a lower 
triangular c-by-c matrix and X and B are 
column vectors. A one-pass solution can 
be achieved with the band SA 1 shown in 
Figure la, if we assume that c = 6. All the 
cells perform multiplication and addition 
except cell,, which performs division and 
change of sign. 

Let us consider two types of cell-to-PE 
mappings: coalescent and cut-and-pile, 
both of which preserve regularity and 
locality in communications. We assume 
that the available SAP has w PEs. 

In the coalescing mapping, cell, for 1 < 
i < c is mapped to PE*, where k = 
\i/\c/w]\ for 1 < k < w. (See Figure lb for 
an illustration of the communication pat¬ 
tern when w = 3.) In coalescing mapping, 
\c/w 1 consecutive cells are assigned to one 
PE, so the PE requires feedback links to 
itself; these links are implemented with 
local memory. Each data item that has 
entered into a PE remains in it for \c/w\ 
cycles. Hence, each PE must have a local 
memory size proportional to [c/wj. In 
addition, the computing load is not uni¬ 
formly distributed among PEs. Because of 
this, PEs 1, 2, and 3 in Figure lb have to 
carry out 11, 7, and 3 computations, 
respectively. 

In the cut-and-pile mapping, cell, is 
mapped to PE*, where k = 1 + (/'- l)mod 
w (Figure lc). Each PE is functionally 
equivalent to a cell in the SA. One feed¬ 
back line between the first and the last PEs 
is required. In general, the size of the mem¬ 
ory needed in the feedback loop is propor¬ 


tional to [c/wj. With our technique, 
attaining sizes proportional to w alone is 
possible. This mapping distributes the 
PEs’ computing load evenly among the 
PEs, and therefore the computing time is 
smaller. The loads of PEs 1, 2, and 3 in 
Figure lc are 9, 7, and 5 computations, 
respectively. We chose cut-and-pile map¬ 
ping for the partitioning methodology we 
propose. 

When we consider using SAs with 2D 
topology to solve problems such as LU 
decomposition, 1 we see that the spatial 
cut-and-pile mapping is similar to the map¬ 
ping just described. In Figure 2, an array 
with 4-by-4 cells and the SAP with 2-by-2 
PEs that results from the mapping are 
shown. The feedback lines that can be seen 
in the figure must be added to satisfy the 
SA communication requirements. This 
bidimensional array is named spiral 
SAP. 15 The feedback lines in the horizon¬ 
tal and vertical flows, which would form 
a torus, are not required because the infor¬ 
mation that flows in these directions is not 
modified in the problems under discus¬ 
sion. For the same reason, the feedback 
line from the last to the first PE in Figure 
lc has been eliminated. It is also possible 
to find mappings from the 2D topology to 
a ID array. 12 Figures lc and 2b illustrate 
the ID and 2D SAP topologies considered 
in this article, and in Figures 1 d and 2c, the 
set of all the necessary operations to be 
performed by PEs is shown. In our 
descriptions of each algorithm, we will 
indicate the particular set of operations to 
be carried out by each PE. 

Temporal mapping. The original prob¬ 
lem must be partitioned into subproblems 
whose data structures fit into the available 
SAP dimensions. The subproblem data 
structures are conditioned by the spatial 
mapping. These subproblems must be 
executed one after another in a manner 
that respects the data dependencies. The 
temporal mapping defines the time at 
which each computation assigned to a PE 
must be performed. We define this tem¬ 
poral mapping by means of the input data 
sequence (that is, by the order in which 
subproblems are executed). 

The temporal mapping directly 
influences the array utilization. A lack of 
global communication capability com¬ 
bined with a high degree of pipelining in 
the SAP may produce low array utilization 
in loading and unloading of subproblems. 
Maximum utilization may be achieved if 
the matrix structures of subproblems and 
the execution order of subproblems allow 
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Figure 3. (a) Triangular-block partitioning of a matrix-by-vector problem, (b) A 
problem transformed by DBT algorithm application. 


each subproblem’s unloading to overlap 
with the next subproblem’s loading. 

The solution of each subproblem is 
arrived at by execution of a band SA. The 
execution sequence for solving all the sub¬ 
problems may be viewed as the execution 
of a new band problem, which we denote ( a ) 
as the transformed problem. The band¬ 
width of the transformed problem fits the 
available SAP size. 

Overlapping the loads and unloads of 
subproblems is equivalent to achieving 
maximum juxtaposition of the subma¬ 
trices that constitute the band of the trans¬ 
formed problem (that is, it is equivalent to 
getting maximum density in the band). 

Thus, the total execution time is 
minimized. 

A set of rules for constructing the trans¬ 
formed problem’s band with maximum 
density must be defined for each type of 
problem. The DBT proposed here achieves 
the transformation of a homogeneous 
problem, such as a matrix-by-vector or a 
matrix-by-matrix multiplication, into a 
band problem of the same type (such a 
band problem is called a homogeneous 
transformed problem). The following 
rules transform the original N-by-M 
matrix A into a band matrix A with band¬ 
width w. TV, M, and w can have any value 
(usually N, M >> w). However, we 
assume (and the assumption does not 
result in loss of generality), that TV = Nw 
and M = Mw. If A does not have these 
dimensions, it is augmented with rows and 
/or columns of zeros until it does. 

Rules for triangular-block partitioning. 

(1) Split matrix A(N,M) into N-by-M 
square submatrices Ay(w,w). 

(2) Decompose each submatrix Ay, 
in turn, into three submatrices: Ay = 

A L y + A„y + Auy, where A L y is the 
strictly lower triangular part of Ay; Auy 
is the strictly upper triangular part of Ay; 
and A Di j is formed with the main 
diagonal of Ay. From these matrices we 
can define A LD y = A Liij + A DiJ and 
A D uy = A„y + Auy, which are, respec¬ 
tively, the lower and upper triangular sub¬ 
matrices of Ay. 

Rules for dense matrix to band matrix 
transformation in inner-product-based 
problems. 

(3) Build band matrix A by alter¬ 
nately juxtaposing A LD y and Auy subma¬ 
trices, or by juxtaposing A L y and A D uy, 
to fill up the band of A. Depending on 
which submatrix is chosen to be first, 
matrix A may be a lower- or an upper- 
band matrix. The following refers to the 


case Ay = A L y + A DU y, where 1 < / < 
TVand 1 < j < Mand A is an upper-band 
matrix. The other cases can be easily de¬ 
rived. The rules are 

(a) For 1 < k < TV M, if A k , k is equal 
to A DUi j, then A k-k + , must be 
equal to A Li m for any m such that 
1 <m<M. 

(b) For l</t< TVM - 1, if A k , k + 1 is 
equal to A L y, then A k+ ,, k + , must be 
equal to A D u n ,j for any n such that 1 < n 
< TV. 

Several A matrices can be obtained by 
the application of the preceding rules. 
However, as we shall note later, in the 
selection of a specific A, implementation 
factors must be taken into account. 

When the original matrix A is dense, the 
transformed matrix A, with maximum 
band density and containing all the subma¬ 
trices of A, can always be obtained. The 
dimensions of A are TV Mw-by- 
(NM + l)wif A is an upper-band matrix, 
or (TV M_+ l)w-by-TV Mw if A is a 
lower-band matrix. When the DBT algo¬ 
rithm is applied to nondense matrices (for 
example, to band matrices), the dimen¬ 


sions of A and the density of its band 
depend on the way the sparsity in A is 
structured. 

In a subsequent section entitled “Parti¬ 
tioning and execution technique, ’ ’ we pre¬ 
sent the partitioning and transformation 
algorithm used to transform the STE, 
TME, and LU problems. 

The matrix-by-vector 
problem 

Here we address the problem of com¬ 
puting vector Y, where Y = AX + B, 
where A is an TV-by-M matrix, and X and 
B are vectors whose respective dimensions 
are M and TV. We assume that our 1D SAP 
has w PEs. Y = A X + B is the trans¬ 
formed problem. In Figure 3a, the parti¬ 
tioning of the original problem is shown 
for N=M=2. Now, each square block 
is decomposed as Ay = A L y + Any¬ 
one of the possible DBT transformations 
that builds up matrix A by following the 
preceding rules is shown in Figure 3b. 
Transformed vectors X and B are deter¬ 
mined by the selected DBT because the 
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Figure 4. I/O data sequencing for the solution of the matrix-by-vector transformed 
problem. 


transformed solution Y = (Y 1 ,Y 2 ,Y 3 ,Y 4 ) 
must include the original solution Y = 
(Y|,Y 2 ). As we can see in Figure 3b, the 
transformed vectors are built from the 
subvectors of w elements in which the 
original X and B vectors have been split. 
Note that B and Y have some subvectors 
that are the same. By means of these recur¬ 
rences, it is possible to make subvectors 
Y 2 and Y 4 of Y equal to subvectors Yi and 
Y 2 of Y, respectively. 

The I/O data sequences for the execu¬ 
tion of the transformed problem for N = 
M = 6 and w = 3 are shown in Figure 4. 
Each PE performs Operation A (Figure 
Id); this operation is equivalent to the one 


carried out in all the cells of the original 
SA. 1 Note that the recurrences between B 
and Y are implemented by means of the 
feedback link between PE, and PE*. All 
the elements of Y that must be introduced 
as elements of B through the NE input of 
PE* are available w cycles before they are 
needed in the NW output of PE,. This 
fixed delay can be achieved by the insertion 
of w equally spaced registers in the feed¬ 
back path; such insertion preserves local 
communication in the array. The feedback 
selection node (labeled “FSN” in Figure 
4) controls the NE data input of PE* so 
that the right computations can be made. 
In the FSN, only a multiplexer is needed to 


select either partial results or the elements 
of vector B. The multiplexer’s “Select” 
signal is easily generated. The Select signal 
is defined by the DBT transformation 
used. We can observe that the i,j index 
sequences of the matrix A elements that 
enter through ports N of the SAP obey a 
regular pattern. Figure 4 shows the simple 
address generation required for the execu¬ 
tion of the transformed problem. 

Because the band of matrix A is dense, 
it is only during 2 w - 2 load cycles and w 
- 1 unload cycles that array utilization is 
below the maximum of / for the SAP. 
The computation time is given by T = 
(2N M/w) + 2w-3. Nevertheless, to reach 
a utilization value near unity, we should 
group into a physical PE the computations 
assigned to two consecutive PEs in the the¬ 
oretical contraflow array. If we do this and 
if the number of available physical PEs is 
w f , w must be equal to 2w f for the DBT 
transformation, and the total computing 
time will be T ~ N M/w f with physical 
array utilization of U~ 1. It is also possi¬ 
ble to execute the transformed problem 
Y = A X + B directly on a ID SAP that has 
w PEs and is without data contraflow. In 
such a case, the computing time is 
T=(NM/w) + 2w - 2, and utilization 
approaches 1 without PE grouping. 


Regular DBTs 

All DBT transformations originate 
matrices with a dense band of minimum 
length. The transformations allow maxi¬ 
mum array utilization and minimum com¬ 
putation time. However, the complexity of 
both input-data-address generation and of 
the feedback selection node depend on the 
DBT type used. In general, extra memory 
is required in the feedback selection node 
to store partial results during some cycles. 
The size of this memory and the input- 
data-address generation depend on the 
selected DBT algorithm. We call DBT 
algorithms “regular” when they permit 
global designs that have minimum com¬ 
plexity. Some regular DBTs that we will 
discuss below are shown in Figure 5. Regu¬ 
lar DBT algorithms are classified into two 
groups: standard and transposed. In the 
standard group (Figure 5a), N M square 
blocks from A(N,M) have been decom¬ 
posed as Ajj = A L jj 4- Auj jj, for 1 < / 
< N and 1 < j < M. The transformed 
matrix A is an A M-by-N M + 1 block 
upper-band matrix. Figure 5a illustrates 
the order to follow in the selection of tri- 
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Figure 5. Regular DBT transformations, (a) Standard DBT algorithms, (b) Transposed DBT algorithms. 


angular blocks when one is building matrix 
A for different algorithms, whether by 
rows, columns, or diagonals, when 
N=2 and Af=3. DBT transformations 
carried out along diagonals originate a 
dense band-transformed matrix with max¬ 
imum band density only if N and M are 
relatively prime. In the previous section, 
DBT transformation carried out along 
rows was the applied algorithm. 

We say that the second group of regu¬ 
lar DBT algorithms is transposed by rows, 
columns, or diagonals. For this group, the 
decomposition is Ajj = A LDiij + A i:i ,j for 
1 < /' < TV and 1 < j < M. A is a lower- 
band matrix with NM+l-by-NM 
blocks. The triangular-block selection 
order is illustrated in Figure 5b. We can 
obtain the same result by applying a stand¬ 
ard DBT to the transposed matrix A and 
then transposing the resulting matrix. 
Note that the technique of Partial Row 
Translations 13 is an instance of regular 
DBT transformation if N and M are equal 


Partitioning and 
execution technique 

The DBT transformation rules we have 
considered so far are used to solve 
homogeneous problems, such as matrix- 
by-vector multiplications. The execution 


of a large nonhomogeneous problem (for 
example, a TSE) on a small SAP, which is 
performed by partitioning the problem 
into subproblems that are subsequently 
chained, may be viewed as the execution 
of a band problem in which different types 
of operations on different frames of the 
band are defined. 

Our partitioning and execution tech¬ 
nique consists of three steps. 

(1) Split the problem into sub¬ 
problems that, in accordance with the 
precedences, are executed one after 
another. The submatrices need not be tri¬ 
angular ones. 

(2) Find a band SA for each sub¬ 
problem. Execute directly those sub¬ 
problems whose size fits the size of the 
available array dimensions. 

(3) There are subproblems whose 
dimensions do not allow direct sub¬ 
problem execution. 

(a) If possible, execute them after 
they have been transformed to band sub¬ 
problems. When the subproblems are 
matrix-by-vector or matrix-by-matrix 
types, some of the previously described 
DBT algorithms can be used. Otherwise, 
new transformation rules must be devised. 

(b) When the procedures given in 
Steps 2 and 3a are not applicable, the sub¬ 
problems must be partitioned once more, 
starting at Step 1. 

In summary, all the subproblems are 


either directly executed or are executed 
after some sort of transformation. Our 
technique for finding the best partitioning 
is heuristic, and it must obtain the maxi¬ 
mum possible overlapping between the 
loading and unloading of subproblems. 

Triangular system of 
equations 

In this section our concern is the solu¬ 
tion of the triangular system equation 
LX = B. L is a lower triangular matrix with 
N-by-N dimensions. X and B are column 
vectors with N elements. The unknowns 
Xj for 1 </<Nare computed by means of 
forward substitutions. The problem is 
solved on a ID SAP with w PEs; the SAP 
is illustrated in Figure lc. 

Hereafter in this article, we will use the 
following notation to denote submatrices 
of an N-by-M matrix A. For example, 
block-row A i-a;b refers to the juxtaposi¬ 
tion of consecutive b-a + 1 blocks of w- 
by-w elements from the same block-row: 

Ai, a:b =(A i ,.A i , a+1 ... A iib ). 

Similarly, A a:b j is the block-column built 
up by the vertical juxtaposition of con¬ 
secutive blocks: 

A a;b , j = (A a , j A a+1 , j ... A bJ ) T . 

The original problem is partitioned and 
executed by applying the previously 
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Figure 6. Chained execution of subproblems, and I/O data sequencing for a trian¬ 
gular system of equations. 


described steps. The first-level partition¬ 
ing algorithm and the order of execution 
of the subproblems is Step 1 as described 
in the section entitled “Partitioning and 
execution technique”: 

Compute X, from L u X, = B, (1) 

For i = 2 to N do 


B, = Bj - Lj,i : n X|:u (2) 

Compute Xj from Xj = Bj (3) 
End for 

The I/O data sequences for executing 
these subproblems on a linear SAP with 
N= 9 and w = 3 are shown in Figure 6. 
Subproblems (1) and (3) are of the same 


type. They consist of the solution of a tri¬ 
angular system with w unknowns. They 
can be directly executed by the SAP 1 
(according to Step 2). The direct execution 
by the array as presented in Figure 6 
implies that PE, must perform Operation 
B (see Figure Id), while the rest of the PEs 
perform Operation A. Subproblems of 
Expression (2) consist of matrix-by-vector 
multiplications with actualization. These 
are particular cases, with N = 1 and M = (/ 
- 1), of the matrix-by-vector multiplica¬ 
tions presented above (Step 3a). For them, 
we use DBT transposed by columns. Now 
PE, has to perform Operation C while the 
rest of the PEs must perform Operation A. 

The FSN is a simple multiplexer. Figure 
6 shows its Select signal sequence and the 
operation that PE, must carry out in each 
cycle. Address and Select signal generation 
is simple. The number of cycles needed to 
solve the problem is T=(N 2 /w) + N+ 
w-2. 

Matrix-by-matrix 

operation 

We now consider the operation E = 
FG + H to be performed on the spiral sys¬ 
tolic array processor (SSAP) with w-by-w 
PEs (Figure 2b); F, G, and H are, respec¬ 
tively, M-by-N, N-by-P, and M-by-P 
matrices. First we split the problem into 
M P disjoint subproblems according to 
the following algorithm (this corresponds 
to Step 1 in the “Partitioning and execu¬ 
tion technique” section): 

For m = 1 to M do 
For p = 1 to P do 

E mp =F mlN G l-N + H m (4) 

End for 

End for 

TheM P subproblems (Expression (4)) 
are solved one after another on the SSAP. 
Every subproblem is of the type D = 
AB + C, where A, B, and C are, respec¬ 
tively, w-by-N,N-by-w, and w-by-vv sub¬ 
matrices. By means of DBT algorithms, 
the problem D = AB + C is transformed to 
a banded one: D = A B + C (Step 3); see 
Figure 7 (matrices C and C have been omit¬ 
ted in the figure). By applying DBT trans¬ 
posed by columns to matrix A, and DBT 
by columns to matrix B, matrices A and B 
are obtained. Matrix A is an (N + 1 )-by-/V 
block lower-band matrix; B is an N-by-(N 
+ 1) block upper-band matrix with blocks 
of w-by-w elements. We define, now, C as 
a tridiagonal (7V+ l)-by-(Af+ 1) block 
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matrix in which Qj = C for i =j = 1 and 
Cjj = 0 otherwise. 

Figure 7a shows the triangular-block 
partitioning for the problem D = 
AB + C, and Figure 7b shows the band 
problem D = A B + C. The result D is a 
tridiagonal (N+ l)-by-(7V+ 1) block 
matrix that can be evaluated by means of 
a 2D SAP with w-by-w PEs. 1 This evalu¬ 
ation can be accomplished by the SSAP if 
all PEs perform Operation A. Matrix D 
can be derived from D if one adds the Di,j 
blocks for 1 <y,j<M+ 1. This computa¬ 
tion may be performed inside the array by 
means of the spiral feedback lines without 
producing any time overhead. The ele¬ 
ments on the main diagonal of D,,, are 
required 2 w cycles later than their appear¬ 
ance in the array output. The other ele¬ 
ments must be delayed w cycles after their 
appearance in the output of the North and 
West ports before being input into the 
South and East ports. Hence, we insert 2w 
equally spaced registers in the diagonal 
feedback path and w registers in the other 
feedbacks. In this way, we preserve the 
local communication requirement. 

The original problem is solved by chain¬ 
ing M P subproblems, as in the problem 
considered above. The selected transfor¬ 
mation requires the insertion of two zero 
blocks between subproblems; one is an L 
block and the other is a U block. In this 
case, the total computation time is 
(3 MPN/w 2 ) + (3 MP/w) + w. A tech¬ 
nique to chain subproblems without zero 
blocks already exists. 10 The improved 
computation time is (3 MPN/w 1 ) + 4w, 
but control is slightly more complex. For 
the SSAP with contraflow, the maximum 
theoretical utilization of 1/3 is reached 
when NMP> > w 3 . Utilization reaches 1 
if we group the computations performed 
by three theoretical PEs into one physical 
PE. The original problem can also be 
solved in an SAP without contraflow, 
where it achieves U ~ 1 without PE 
grouping. 


Triangular matrix 
equations 

A matrix equation is a set of linear sys¬ 
tems, all sharing the coefficient matrix but 
with different right-hand-side vectors. In 
this section, we are concerned with the 
solution of the lower triangular matrix 
equation LY = B to be performed on a w- 
by-vv SSAP; L is an N-by-N lower trian¬ 
gular matrix and Y and B are N-by-M 



Figure 7. (a) Triangular-block partitioning of a matrix-by-matrix problem, (b) The 
DBT-transformed problem. 


matrices. In the first level of partitioning, 
the problem is split into M independent 
subproblems, each one with Y and B of 
dimensions N-by-w. They are executed 
one after another. Each N-by-w system is 
partitioned and executed following the 
algorithm 

Compute Yj from L t =6! (5) 

For i = 2 to N 

B. = B. — L. j/Y,.., (6) 

Compute Y.from L. .Y. = B. (7) 

End for 

(This corresponds to Step 1.) 

Subproblems (5) and (7) are the solution 
of w-by-w matrix equations. Consider, for 
example, Subproblem (5). The matrix 
equation size does not allow a direct exe¬ 
cution, which corresponds to Step 3. For 
this reason, Subproblem (5) is decomposed 
as follows: 


Compute Y DUl from 

( l .,, y d U .U= b du 1 W 

B 'li =B li -(L i i Y dui } l (9) 

Compute Y from L [ t Y u = B ' L1 (10) 

Subproblems (8) and (9) are computed 
simultaneously by the SSAP when the 
boundary PEs are programmed in such a 
way that 

• PE,,] performs Operation C, 

• PE,, y for2 </ < w performs Oper¬ 
ation B, and 

• the rest of the PEs perform Operation 
A. 

In order to get the desired result, Li t) 
must be input through the West boundary 
PEs, and Bi through the South and East 
PEs. Result Yux is obtained from the 
North PEs. At the same time, matrix B ' L1 
is obtained from West PEs. Subproblem 
(10) is perfectly chained to the previous 
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Figure 8. Chained execution of and I/O data sequencing for the three subproblems 
that result from partitioning of the w’-by-w triangular matrix equation. 


one if B ' u is input through the PEs, and 
Y L i is obtained from North PEs. Figure 8 
shows the chaining of Subproblems (8), 
(9), and (10). Subproblem (6) is a matrix 
actualization that is solved by means of 
DBT. Observe that now PE W for 1 < 
/< w must perform Operation D (Figure 
2c) to change the sign of the previously 
computed matrix, Yum. The computation 
time needed to find the unknown matrix 
Y with TV-by-Melements is T= (3MN 2 /2- 
w 2 ) + (3MN/2w) + w. 

LU decomposition 

Given a square matrix A with dimension 
N, two triangular matrices, L (lower) and 
U (upper), must be found to verify that 
A = LU. We assume that A can be decom¬ 
posed and that each of the diagonal ele¬ 
ments of L is equal to 1. The LU 
decomposition problem is carried out in 
the w-by-wSSAP. 

In the first level of partitioning, matrix 
A is divided into square blocks, A^; each 
block has w-by-vvelements. Additionally, 


matrices L and U are split into their respec¬ 
tive blocks; each block has w rows and 
columns. The partitioning method for and 
the execution sequence of the resulting 
subproblems are specified in the following 
algorithm: 

Compute L ( [ and U [ [ from 

A n = L ii U n (11) 

For i = 2 to N do 
Compute Uj t . from 

L ii-i i i-i U i i-n =A i i-li (I2) 
Compute L. from 

L i l i-l U l i-l ,-i_ t = A j ,.j_, 03) 

Compute A.. = 

’ A i .-L. 1 .._ t U M _ 1 - 1 (14) 
Compute L.. and U.. from 

A.. = L. .U.. (15) 

End for 

(The algorithm corresponds to Step 1.) 

Subproblems (11) and (15) are of the 
same type: The LU decomposition of a 
matrix A M (w,w). Both can be directly 
solved by the SS AP (corresponding to Step 
2). Matrix A M must enter the array 


through East and South PEs, and matrices 
L m and Ujj go out, respectively, through 
West and North PEs. The operations per¬ 
formed by each PE are as follows (Figure 
2c): PE U performs Operation E; PE 1>; 
for 2 S y S w perform Operation B; 
PE y 2 for 2 < i < w perform Operation F; 
and the rest of PEs perform Operation A. 

Subproblems (12) and (13) are of the 
same type: One is the transposition of the 
other. Each one is the solution of a trian¬ 
gular system of matrix equations, LX = 
B; this system of equations has been 
addressed previously. The subproblem in 
Expression (14) is a matrix-by-matrix mul¬ 
tiplication with actualization. We have 
commented before that such a subproblem 
is solved by applying DBT transformation 
(Step 3a). 

In the chaining of Subproblems (15), 
(12), (13), and (14), some blocks of zeroes 
are required, but they are not necessary 
between Subproblems (14) and (15). The 
total execution time is given by T= 
(N 2 /w 2 ) + (3N 2 /2w) + (N/2) + w. 

Other problems 

Given the problems whose solutions we 
have already demonstrated, it is easy to 
solve other problems, such as matrix equa¬ 
tions and matrix inversions. For instance, 
to compute the inverse of a matrix A, we 
can apply the following algorithm. 11 

(a) Perform LU decomposition of 
matrix A. 

(b) Solve the lower triangular matrix 
equation, which is LL' 1 = I where L' 1 is the 
inverse of the triangular matrix L, and I is 
the identity matrix. The regular sparsity of 
matrices L and L' 1 is taken into account in 
the application of DBT transformations; 
such application allows us to optimize 
computing time to T=(N 3 /2w 2 )+ 
(3N 2 /2 w)+N+w. 

(c) Solve the upper-triangular matrix 
equation, which is U''U = I; this step is 
analogous to Step (b). 

(d) Compute the matrix product 
A' 1 = u ’L' 1 . The computing time, which is 
improved as in Step (b) by exploiting the 
triangularity of the involved matrices, is 
T=(N 3 /w 2 ) + (3N 2 /2 w) + (N/2) + w. 
The resulting matrix-inversion time is 
T=(3N 3 /w 2 )+(6N 2 /w)+QN)+ w. 


Pipelined processing 
elements 

In the design of SAPs, whether they 
have or do not have data contraflow, it is 
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desirable to use pipelined arithmetic units 
to increase throughput and, consequently, 
to decrease computing time. When arrays 
with neither intra-PE cycles nor data con¬ 
traflow are used, it is easy to attain an effi¬ 
cient use of two-level pipelining. 14 If 
SAPs with data contraflow or intra-PE 
cycles are considered, two techniques can 
be used to take advantage of pipelining 
when we execute big problems on small 
arrays: 

(a) grouping the input of adjacent data 
streams into an individual physical 
PE and 

(b) overlapping (that is, parallel) execu¬ 
tion of independent or chainable 
subproblems that result from the 
original problem’s decomposition. 

The number of grouped streams in (a) and 
the number of parallel subproblems 
executed in (b) are proportional to the 
number of pipeline stages in the arithmetic 

The structures present in contraflow 
SAPs are diagonals. They are grouped 
after the DBT algorithm is applied. This 
algorithm balances the load allocated to 
each processor and reduces to a minimum 
loading and unloading times. The required 
local memory is a function only of the 
available SAP dimensions, since it is 
independent of the size of the problem to 
be solved. 

Subproblems considered in Technique 
Step (b) are obtained from a first-level par¬ 
tition. These subproblems fit one or two 
dimensions of the available array. They 
are transformed in a later step. 

Matrix-by-vector problem execution 
that involves the application of a DBT 
transposed by columns algorithm and 
diagonal groupings (Technique (a)) is 
depicted in Figure 9a. The matrix size is 
8-by-8, and the array has w = 2 PEs. Each 
PE includes an adder and a multiplier, 
both of which have three pipelining stages. 
There are 4w registers in the feedback 
path. 


A t present, systolic array processors 
are a viable choice for solving a 
wide class of matrix problems. 
Systolic algorithms have usually been con¬ 
ceived with the assumption that an 
unlimited number of processing elements 
are available. The necessity for partitioning 
problems appears when the algorithm 
requires more processing elements than exist 
in the array. Partitioning is an essential step 
in mapping algorithms into both algorithm- 



Figure 9. (a) I/O data sequencing for matrix-by-matrix execution in a ID systolic 
array with pipelined PEs. (b) A pipelined PE. 
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specific and versatile systolic array 
processors. 

We have presented a technique for trans¬ 
forming problems with dense matrices of 
any size into problems with band matrices 
whose bandwidth fits the systolic array 
dimensions. The transformed problems are 
efficiently solved through the execution of 
systolic algorithms oriented to band prob¬ 
lems. These algorithms require low com¬ 
plexity in the array. We have proposed 
transformation algorithms (DBT s) that can 
be applied to inner-product-based 
homogeneous problems such as matrix-by¬ 
vector and matrix-by-matrix multiplication. 
The original matrices are partitioned into 
triangular submatrices. The band of the 
transformed-problem matrices is obtained 
by means of suitable juxtaposition of trian¬ 
gular submatrices. For other problems, such 
as triangular systems of equations and LU 
decomposition, several partitioning steps 
are performed. Subproblems that fit the 
array size are executed directly, and the rest 
are executed after a DBT transformation. 
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For all the problems, array utilization is 
maximized, and consequently computation 
time is minimized, because of the perfect 
overlap between the loading and unloading 
of triangular subproblems. 

We have discussed the solution of matrix- 
by-vector multiplication and triangular sys¬ 
tem of equations problems on a ID array, 
and the solution of matrix-by-matrix mul¬ 
tiplication, triangular matrix equations, and 
LU decomposition problems on a 2D array. 
Nevertheless, all the problems we have dis¬ 
cussed can be solved on either ID or 2D ver¬ 
satile systolic arrays. 

The systolic arrays we have discussed (ID 
and 2D) have feedback paths for the com¬ 
munication of partial results. These paths 
are pipelined, which preserves local com¬ 
munication. All the PEs must perform mul¬ 
tiplications and additions, and only one of 
them needs to be able to perform divisions 
and changes of sign. The functionality of 
only a few PEs must be modified in some 
cycles (PE ( in the ID array, and the North 
and West boundary PEs in the 2D array). 
The control signals of these PEs can be 
generated by a centralized unit; the control 
signals flow through the boundary PEs in 
a pipelined fashion. In this way, locality of 
control communications is maintained. 
Finally, we have addressed the design of 
data contraflow arrays with pipelined 
PEs.D 
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T he issue of Computer that you are 
now reading showcases six 
research projects; the feature arti¬ 
cles in which they are described all discuss 
the realization of systolic arrays from con¬ 
cept to implementation. This special sec¬ 
tion provides an overview of seven more 
projects that have to do with the design 
and implementation of systolic arrays. The 
first six projects are concerned with the 
implementation of signal/image- 
processing applications. The last project is 
concerned with the design of hexagonally 
connected processing elements in such a 
way as to allow the direct execution of 
dataflow graphs. These seven projects 
show a variety of successful but different 
implementations of application- 
dependent systolic arrays. 

The collection of research projects 
described in this section is listed below, 
together with the entities that are carrying 
out the projects. These include 

• a GaAs systolic array for adaptive 
null steering beamforming, RCA; 

0018-9162/87/0700-0091 $01.00 © 19871EEE 


• a systolic processor for adaptive 
beamforming, ESL; 

• a systolic array for digital signal 
processing, Motorola; 

• a systolic array for linear algebraic 
and cellular operations in signal 
processing, Hughes Research; 

• P-NAC, a systolic array for compar¬ 
ing nucleic acid sequences, Brown 
University; 

• Line, a reconfigurable systolic array 
for signal processing, General Elec¬ 
tric; and 

• a hexagonally connected processing 
array for mapping dataflow graphs, 
Technion (Israel) and the University 
of Massachusetts. 

Each summary includes the name and 
address of at least one person who can be 
contacted for further information. The 
summaries also reference published work 
in the field. 

We hope that this special section will 
serve as the beginning of a forum for the 
exchange of ideas. 
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The Design of a GaAs Systolic Array for an 
Adaptive Null Steering Beamforming Controller 


Carl E. Hein, RichardM. Zieger, and Joseph A. Urbano 
RCA Advanced Technology Labs, Moorestown, NJ 08057 


O ur design of the RCA GaAs sys¬ 
tolic array beamforming control¬ 
ler demonstrates the advantages 
of using a top-down approach for design¬ 
ing an adaptive radar system. Because of 
its speed and unique characteristics, the 
GaAs technology chosen to implement the 
array significantly influenced the design of 
the internal processor architecture. The 
array is configured as an SIMD machine 
in which each processing site communi¬ 
cates only with its nearest neighbors. The 
array is intended for use in digital radar 
beamforming for automatic null steering. 

Several systolic arrays have been pro¬ 
posed or constructed for similar applica¬ 
tions. Among them are the RCA 
Gram-Schmidt Preprocessor, 1 the NOSC 
(Naval Ocean Systems Center) systolic 
array, 2 and the Carnegie Mellon Warp 3 
machine, which was constructed in 
cooperation with General Electric and 
Honeywell. 

The selection of an algorithm that could 
adaptively generate the complex beam¬ 
forming coefficient vector was the first 
step in our design process. Of the many 
such algorithms studied, including MSR3, 
Gram-Schmidt, and Givens, the MSR3 
Algorithm was identified as offering the 
best combination of performance, stabil¬ 
ity, and ease of systolic implementation. 
Each algorithm’s performance was judged 
by the algorithm’s speed of convergence, 
the depth of nulls, and the shape of the 
antenna pattern produced. The MSR3 
Algorithm became the design guide for the 
systolic array. 


The MSR3 Algorithm iteratively con¬ 
verges on a solution to the beamforming 
coefficients by minimizing the error in the 
energy received by the antenna. The com¬ 
putational core of each iteration consists 
of a pair of doubly nested DO loops; each 
loop has several arithmetic operations 
inside. Such a sequential code structure 
implies a two-dimensional systolic array of 
processors. 

For a doubly nested loop, a two- 
dimensional array of processing nodes can 
be used in which each node corresponds to 
an iteration of the outer loop, and each 
column corresponds to an iteration of the 
inner loop. In the application, the inner- 
array index limit is dependent on the outer- 
array index, as shown in the following: 


FOR i : = 1 to N DO 
FOR j : = 1 to i DO 

X(i + 1, j) : =X(i,j) * Y(i,j); 
Y(i, j + 1): =X(i,j) * Y(i,j); 
FOR i : = N to 1 DO 
FOR J : = i to 1 DO 

X(i+1, j): = X(i,j) * Y(i,j); 
Y(i, j +1): =X(i,j) * Y(i,j); 


This dependency eliminates half of the 
node positions and results in a triangular 
array. 

A doubly nested loop can be imple¬ 
mented as a linear array in which each 
node performs every iteration of the inner 
loop for each iteration of the outer loop. 
However, implementing this outer loop 
would result in an unbalanced processing 
load because, for each invocation, the /' = 1 
processing node has only one iteration to 


perform, whereas the i = N processing 
node has N iterations. 

The simulation of the algorithm does 
not contain assumptions or information 
about the physical structure of the com¬ 
puting elements. As a result, the simula¬ 
tion computes the algorithm by 
performing each operation in sequence. 
To assist in determining the order of data 
transfers and operations in the full array 
of processors running under the steady- 
state condition, we modified the original 
simulation of the algorithm into a system 
simulation that recognized the parallel exe¬ 
cution of the array and the distribution of 
variables in space and time. 

From the system simulation, the 
sequence and locations of the data trans¬ 
fers between nodes were easily traceable. 
Of the algorithm’s two pairs of DO loops, 
the second pair has a deleterious effect on 
the performance of the array because it is 
indexed in reverse order from the first. In 
the systolic array, dataflow in the first DO- 
loop pair flows from left to right and from 
top to bottom, but in the second DO-loop 
pair, data flows from right to left and bot¬ 
tom to top. The implication is that instead 
of the data from succeeding invocations 
propagating in continuous waves across 
the array, one wave must propagate across 
the array in one direction, then another 
wave must propagate across the array in 
the opposite direction. This destroys the 
ability to pipeline the operation efficiently 
and results in the majority of nodes being 
idle most of the time. 
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We chose a unique combination of sys¬ 
tem and nodal architecture; this architec¬ 
ture produces continuous, pipelined 
dataflow by providing storage locations at 
each node that allow both waves to run 
simultaneously in opposite directions on 
one triangular array of processors. It does 
this by creating a pipeline of operations 
that has a depth equal to twice the number 
of processors along the diagonal of the 
array. The storage registers accommodate 
the numerous stages of the pipeline by sav¬ 
ing the critical values from each stage. In 
this way, the dataflow of the algorithm is 
preserved in space because it skews oper¬ 
ations over time. 

The intricate data and time dependen¬ 
cies within the systolic array are so compli¬ 
cated that it was virtually impossible to 
continue the design without simulation at 
the system level. The system simulation 
was useful in determining bottlenecks and 
critical paths at the system level; it 
employed information about the length of 
time required, and about the data 
produced and consumed, by each opera¬ 
tion. Initially, these quantities were esti¬ 
mated. As the design progressed and more 
information became available, the simu¬ 
lation determined more precisely the time 
constraints for each operation and data 
transfer. 

Simulation demonstrated that under 
steady-state processing, the processors in 
the array run a single sequence of opera¬ 
tions in three phases with a 100 percent 
utilization of the interior processors. 

The system architecture of a systolic 
array can be made fault tolerant if a nodal 
interconnection scheme is provided so that 
a failed processing node can be bypassed 
and processing nodes in spare rows and 
columns can be used. 

The system architecture set certain con¬ 
straints on the internal architecture of the 
nodal processor and dictated the proces¬ 
sor’s functionality. Communication ports 
going in and out are required on each of 
the four sides of the nodal processor. Both 
the simplicity of the operations and the 
construction advantages suggested that all 
circuitry for a processing node can be par¬ 
titioned onto one chip. The data ports were 
serialized by nibbles to lessen their num¬ 
ber. The algorithm in combination with 


the system architecture dictated the nodal 
processor’s register set, which consists of 
four general registers used for holding crit¬ 
ical values of the algorithm between oper¬ 
ations. 

A throughput requirement on the order 
of 5 ps per coefficient update indicated the 
need for clock rates in the 100-MHz range. 
These clock rates are currently reachable 
with GaAs parts. Unfortunately, GaAs 
fabrication capabilities are not as 
advanced as those of silicon. Conse¬ 
quently, the internal architecture of the 
nodal processor reflects many trade-offs 
in circuit complexity. Since the processor 
is being custom built to perform 
MSR3-like algorithms, only the minimum 
set of data paths is provided, and control 
is effected through logic driven by a state 
counter. 



Circuit complexity was further reduced 
25 percent by the selection of a special 
pseudo-floating-point number format that 
offers the equivalent range of 14-bit fixed- 
point representation but uses only 10 bits. 
This format was developed to meet range 
and accuracy requirements of the beam¬ 
forming application. This format reduces 
the interchip communication volume by 29 
percent. 

Gallium arsenide technology offers 
important advantages for real-time signal¬ 
processing applications. GaAs has a high 
speed/power factor, is radiation hard (in 
terms of total dose), and temperature 
tolerant. Although GaAs has problems 
that arise from its limited radiation hard¬ 
ness, high wafer density dislocations 
(which result in low yield), high cost, small 
noise margin, and the lack of high-speed 
testing equipment available for it, most 
GaAs problems are expected to be tem¬ 
porary. 

For the GaAs systolic array, RCA has 
designed and fabricated a GaAs 32-bit 
ALU test chip and a 500M-bps Manchester 
Encoder. A GaAs 200-MIPS, RISC, 8-bit 


microprocessor and a Manchester 
Decoder were also designed. The GaAs 
chips were fabricated by Triquint Corp. of 
Beaverton, Ore., by means of a low-power 
process (a typical gate dissipates about 780 
pW) that incorporates 1-micron enhance¬ 
ment/depletion MESFET technology in 
both gate-array and standard-cell options. 

The standard cells used in the systolic 
array are limited to high-yield 1- to 5-input 
NOR gates. The systolic array includes 
single-chip, standard-cell, GaAs process¬ 
ing nodes. Each processing-node chip has 
about 2100 of the small NOR gates. 

Working from specifications of data- 
transfer rate, coefficient-update rate, and 
GaAs technology limitations (as regards 
gates and pins), we determined the mini¬ 
mum clock rate, the number and types of 
chips, and the number of handshaking 
lines required. 

In the array, only one processor design 
was used. The clock rate is 120 MHz; the 
rate of data transfer between processors is 
24 MHz; the number of pins is 42. The sys¬ 
tem updates the coefficients for the mul¬ 
tiple beams every five ps. 

Data is transmitted serially over eight 
lines (that is, over the two unidirectional 
lines on each of four sides of a processing 
node in the array). A RISC-like philoso¬ 
phy of hardware-to-software trade-offs 
was used to meet the low-gate-count 
restriction. 

The efficient design procedure reported 
here has resulted in the design of a realiz¬ 
able, special-purpose, hardware-efficient 
systolic array with real-time performance 
unmatchable by any realizable uniproces¬ 
sor system. 
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A Systolic Signal Processor for 
Signal-Processing Applications 


Douglas A. Kandle 

ESL, Inc., 495 Java Dr., MS 302, PO Box 3510, Sunnyvale, CA 95088 


A s part of its Adaptive Beam- 
former project, ESL, Inc., has 
completed a 350-MFLOP sys¬ 
tolic processor that performs adaptive 
beamforming for acoustic signal¬ 
processing applications. The adaptive 
beamformer itself implements the Mini¬ 
mum Variance Distortionless Response 
Algorithm. 1 A frequency-domain adap¬ 
tive beamformer, it was developed on a 
systolic architecture processor imple¬ 
mented with custom VLSI chips. ESL’s 
Systolic Adaptive Beamformer project is 
intended to demonstrate the applicability 
of systolic processing techniques to acous¬ 
tic signal processing. 

The system consists of a special-purpose 
systolic processor that attaches to a sensor 
array (100 channels, each of which 
produces 12-bit data samples at a rate of 
3000 samples/sensor/sec) and to ageneral- 
purpose host computer (the Digital Equip¬ 
ment Corp. VAX 11/750). Figure 1 is a 
block diagram of the system. 

The processor was built to perform 
narrow-band, passive, sonar signal 
processing. The basic signal-processing 
requires that the inputs from many hydro¬ 
phones be linearly combined in the fre¬ 


quency domain so as to produce a 
directional acoustic receiver with very high 
gain. A sensor array has a complex spatial- 
response pattern. It will “hear” sounds 
from every direction; these have varying 
degrees of amplification (or attenuation). 
The goal of an adaptive beamformer is to 
combine the sensor outputs in such a way 
that the array will hear sounds coming 
from the desired direction but will mini¬ 
mize the total energy (from noise and inter¬ 
fering sources) received from all other 
directions. 

A complete description of the adaptive 
beamforming process is beyond the scope 
of this article. However, from a 
computing-architecture point of view, the 
process can be broken down into three lin¬ 
ear algebra problems that must be solved. 

(1) Factor a matrix. (That is, compute 
a matrix U new defined by U new U n J w = U 0 i d 
U 0 t d + zz*. U old is an n x n upper trian¬ 
gular matrix with real, positive values on 
the diagonal, and z is a vector in C". 

(2) Solve linear systems of the form 
Ux = d, where x and d are vectors in C" 
and U is as defined above. 

(3) Compute inner products of the 
form w*z where w and z are in C". 


ESL has developed a custom VLSI chip 
that is a systolic cell. The systolic-cell chip 
is used to solve all of the above problem 
types. The systolic-cell chip is a high¬ 
speed, floating-point multiplier/adder 
designed to support complex arithmetic. 
Chip operands are 32-bit, IEEE-format, 
floating-point numbers. Successive oper¬ 
ands are treated as a sequence of real, then 
imaginary parts of complex numbers. The 
chip has three input-data ports (A, B, and 
C) and two output-data ports (BD and 
CD). All five data ports are 1 byte (8 bits) 
wide and require eight consecutive 1-byte 
data transfers to complete a complex oper¬ 
and transfer. Data moves at a rate of 1 byte 
every 100 ns, thus requiring 800 ns to trans¬ 
fer a complex value. 

The principal operation performed by 
the systolic-cell chip is to accept three input 
operands (A, B, and C) and produce two 
results (BD and CD) such that 

BD •*- B 

CD A x B + C 

The signals BD and CD are delayed seven 
cycles, which is the pipe depth of the chip. 

In addition to the input-data ports, there 
are OC, FC, and BC ports. These ports 
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Figure 1. A block diagram of ESL, Inc.’s 350-MFLOP systolic processor system. 
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An Advanced DSP 
Systolic-Array Architecture 

Steven B. Leeland 

Motorola, Inc., Government Electronics Group, 2501 S. Price Rd., 

Chandler, AZ 85248-2899 


determine the operation performed and 
effectively constitute an “instruction.” 
Inputs CA, CB, and CC selectively negate 
and/or conjugate the A, B, and C inputs. 

The Systolic Adaptive Beamformer pro¬ 
ject has shown that a systolic-architecture 
processor can achieve very high processing 
throughput even in a very small physical 
space. Single-board systolic processors 
with a computational throughput of up to 
120 MFLOPS were built. The 
350-MFLOP system has achieved very 
high processing utilization rates (in excess 
of 90 percent of available computational 
power). To achieve its very high sustained- 
throughput rates, the system is tailored to 
a specific set of linear algebra problems. 
While this tailoring is very restrictive from 
a mathematical viewpoint, the class of 
problems that the systolic processors can 
solve is quite large. In general, the system 
can solve any linear least squares problem 
of order 50 or less. The upper bound of 50 
is imposed by the memory sizes of the sys¬ 
tolic processors. 

ESL is currently building a second- 
generation systolic processor that will be 
able to perform at a substantially higher 
throughput. The original design has been 
improved by adding both more systolic 
cells in each processor and more flexible 
microcode addressing modes. 
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M otorola has been working on 
an advanced digital signal 
processing (DSP) systolic- 
array architecture. Progress to date 
includes a completed architecture design, 
a design for processing-element logic, 
architecture and algorithm simulations, 
and partially completed logic simulations. 
The primary goals for this architecture are 
• to increase the processing perform¬ 
ance by a factor of 16 over that of an 
existing Motorola DSP system (a cus¬ 
tom VLSI) that executes demodula¬ 
tion algorithms, 

• to perform 32-bit floating-point 
arithmetic for applications requiring 
great precision, such as a 64K point 
Fast Fourier Transform, and 
• to reduce the effort needed to develop 
software for the architecture. 
Processing-element size is a critical fac¬ 
tor in a systolic-array architecture. Small 
processing elements allow more elements 


per chip, which in turn reduces overall sys¬ 
tem size. To reduce processing element 
size, Motorola has based the architecture 
on processing elements with 32-bit 
floating-point serial processors; the 
processing-element structure is shown in 
Figure 1. Serial processors require an order 
of magnitude less logic than parallel ver¬ 
sions. The processors for the new architec¬ 
ture are serial, and they require 50 clock 
cycles to perform each operation. Chip 
operating speed should exceed 20 MFIz, 
since the longest logic path is the one used 
by an 8-bit exponent adder. Therefore, a 
new data sample can be introduced into 
the array every 2.5 ps, which exceeds the 
rate for introducing new data samples in 
the existing Motorola DSP system by a 
factor of 20. 

Because the floating-point processing 
consumes only 50 clock cycles, it is possi¬ 
ble to improve intercell communications 
over those in the existing DSP. During 
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these normally unusable times (the clock 
cycles), data is allowed to flow through 
cells transparently. A network of East, 
West, North, and South buses allows data 
to travel as far as 50 cells away. Many 
systolic-array architectures only allow 
direct data transfer between nearest- 
neighbor processing elements. 1,2 

DSP-software development is often a 
time-consuming and arduous task. Moto¬ 
rola’s new systolic-array architecture has 
already reduced this effort. Algorithms are 
implemented directly in this architecture. 
Every cell is initialized at least once after 
each power-up. From that point (power- 
up) on, the cell performs the same opera¬ 
tion every processing cycle. It also gets its 
input data from the output of an assigned 
“neighbor” cell. This pre-set operation 
makes it possible to assign the function of 
each cell directly from a signal-flow dia¬ 
gram. The new systolic-array architecture 
essentially eliminates sequential software. 
The software-development tool is written 
in Pascal for an IBM PC. The basic con¬ 
cept for the development tool is a spread¬ 
sheet processor. Algorithms are entered, 
copied, and replicated just as in any 
spreadsheet tool. As an added feature, the 
algorithm can be tested, simulated, and 
debugged with the same tool. 

Architecture simulations have recon¬ 
firmed that this systolic design performs 
best on algorithms with strong locality of 
signal flow. 3,4 The date when the architec¬ 
ture will be usable is drawing nearer as 
logic simulations approach completion. 
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The Systolic/Cellular System 
for Signal Processing 


J. Greg Nash, K. Wojtek Przytula, and S. Hansen 

Hughes Research Laboratories, 3011 Malibu Canyon Rd„ Malibu, CA 90265 


W e designed the Systolic/Cellu¬ 
lar System for large classes of 
linear algebraic and cellular 
operations that are used in signal process¬ 
ing. It consists of a host and a programma¬ 
ble coprocessor. The coprocessor includes 
an array of 16 X 16 mesh-connected 
processors, dual-port array memory, and 


a controller with a separate program mem¬ 
ory (see Figure 1). The input data and the 
programs for the coprocessor are loaded 
from the host into the array memory and 
the program memory, respectively. 

The system can operate in two modes: 
cellular and systolic. In the cellular mode 
of operation, the input data are first 
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Figure 1. Block diagram of the Systolic/Cellular System. 



Figure 2. Organization of the processors. 
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loaded into the processor array from the 
array memory and are then processed in 
unison. After the computation is com¬ 
pleted, the results are unloaded from the 
processors and the next computation cycle 
begins. We used this mode to implement 
point and window operations (threshold¬ 
ing, convolution, and so on), fast trans¬ 
forms (Haar, Hadamard, ID and 2D 
FFT—each comes in radix 2 and 4 ver¬ 
sions), sorting, and some of the matrix 
operations (addition, transposition). 

In systolic computations, the data 
blocks leave the array memory in a row- 
by-row fashion through one port, then 
“flow” through the array, during which 
time they undergo appropriate processing. 
The results return to the array memory 
through the other port. The systolic oper¬ 
ations are implemented by means of only 
two algorithms, both generic: the modified 
Faddeeva Algorithm 1 and the algorithm 
of Frank Luk. 2 We used the Faddeeva 
Algorithm for implementation of matrix 
operations (inversion, QR factorization), 
solution of systems of linear equations 
(including dense and banded systems 
larger than the array), and least squares 
problems (overdetermined, underdeter¬ 
mined, generalized). We used the algo¬ 
rithm of Frank Luk for matrix, 
eigenvalue, and singular-value decompo¬ 
sitions. All the operations listed above are 
used frequently, not only in signal process¬ 
ing but also in many other applications— 
for example, in robotics. 

The architecture of the coprocessor is 
tailored to the two generic systolic 
algorithms, which have a similar underly¬ 
ing structure, and to selected cellular 
algorithms, such as fast transforms. 3 The 
processing is performed under the control 


of a simple SIMD controller. The leftmost 
column of the array (which consists of 
boundary processors) can be programmed 
to perform different operations from the 
rest of the processors (which are the inter¬ 
nal processors). All processors within these 
two parts of the array operate in unison. 
However, it is possible to disable selected 
processors for any computation step. 

A special, custom VLSI processor has 
been designed for the array (see Figure 2). 
It is a 32-bit, fixed-point, dual-bus proces- 



MOPS. 


sor with a bit-slice structure. There are 
seven functional units in it: two mul¬ 
tipliers, two adders, a divider, and a com¬ 
parator. 4 Each processor also contains 24 
memory registers. The arithmetic 
algorithms of the functional units have the 
necessary control embedded in hardware, 
which adds considerable speed over that 
attained by means of a microcoded 
approach; also, multiple functional units 
can perform computations concurrently. 
Our choice of the functional units was dic¬ 
tated by the target applications. For exam¬ 
ple, FFT and Faddeeva algorithms require 
four multiplications per computation 


stage; therefore, we provided more than 
one multiplier. The processors have four 
multiplexed 16-bit I/O ports for intercon¬ 
nection with their four nearest neighbors. 
The control over the ports is separate from 
the control of the data path so that com¬ 
munication between the processing ele¬ 
ments can take place during the 
computational activity. 

An instruction set of the array consists 
of about 30 powerful, wide instructions. 
The machine instruction, which is 112 bits 
wide, has two separate fields for operation 
codes for boundary and internal proces¬ 
sors; additionally, these two fields are sub¬ 
divided into I/O and computation 
subfields. A third field is used for system 
codes, and the fourth for a mask that 
determines which processors are to be dis¬ 
abled in a given cycle. The programming 
environment of the array consists of an 
assembly language and a library of 
preprogrammed macros. 

Maximum system performance is in the 
neighborhood of 450 MOPS. Data- 
transfer rate between the dual-port mem¬ 
ory and the array is approximately 340M 
bytes/sec. A simulation of typical bench¬ 
mark algorithms indicates that a solution 
of a linear system of 100 equations requires 
22 ms, and a 1024-point, complex (32-bit) 
FFT is computed in 260 pis. The prototype 
system is in the final stages of integration 
and will be tested this summer. 
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P-NAC: A Systolic Array for Comparing 
Nucleic Acid Sequences 


Daniel P. Lopresti 

Dept, of Computer Science, Brown University, Providence, RI02912 


T he Princeton Nucleic Acid Com¬ 
parator (P-NAC) is a linear sys¬ 
tolic array for comparing DNA 
sequences. The architecture is a parallel 
realization of a standard dynamic pro¬ 
gramming algorithm. Benchmark timings 
of a VLSI implementation confirm that, 
for its dedicated application, P-NAC is 
two orders of magnitude faster than cur¬ 


rent minicomputers. 1 Experience with the 
prototype is shaping the design of a 
second-generation device, to be known as 
the Brown Nucleic Acid Comparator (B- 
NAC), that will be algorithmically flexible 
and more tolerant of fabrication faults. 

The primary structure of DNA can be 
specified as a string of characters chosen 
from the alphabet {A, C, G, T}. While 


there exist a number of different metrics 
for comparing strings in general and DNA 
sequences in particular, it is not yet under¬ 
stood which are biologically valid. 
Nevertheless, an intuitively satisfying mea¬ 
sure assumes that DNA “evolves” by 
undergoing a series of three elemental 
steps: the deletion of a single “character” 
(nucleotide), the insertion of a single 


d 00 = ° 

d|o = d i _ 10 + c de| ( s 1 ) 1 <i<m 

d o,i = d o,j-, + c JV 1 - jsn 


Figure 1. The initial conditions for determining evolutionary 
distance. 
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Figure 2. The recurrence relation for determining evolution¬ 
ary distance. 
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character, and the substitution of one 
character for another. The similarity 
between two DNA sequences, S = 
5 j5 2 .. ,s m (the source) and T - t\h-..t n (the 
target), can be quantified by assessing a 
charge for each of these steps and then 
finding the least expensive transformation 
of S into T. 

The cost associated with such a transfor¬ 
mation is designated the evolutionary dis¬ 
tance; it can be determined by making use 
of a well-known dynamic programming 
approach. 2 Let d itj be the distance 
between the subsequences SiS 2 ...s, and 
tdj...tj. The initial conditions are shown 
in Figure 1, and the recurrence relation is 
shown in Figure 2. The algorithm builds an 
m x n table in which the desired result, 
d m ,n, appears in the lower right corner. 

Fortunately, there is tremendous poten¬ 
tial for parallelism in the construction of 
such a distance table; all values d u can be 
calculated simultaneously for a given k = 
(j-i). Mapping this recurrence onto a lin¬ 
ear systolic array is a straightforward pro¬ 
cess, although there are several possible 
dataflows that can yield alternative 
architectures. One such design is shown in 
Figure 3. The source and target sequences 
form two data streams that shift in from 
the left and right, respectively; one charac¬ 
ter is input every clock cycle. Initializing 
distances taken from the first row and 
column of the dynamic-programming 
table travel with the characters. As these 
values pass through the array, they are 
transformed into distances from the last 
column and row including d mi „. A proces¬ 
sor performs one inner-loop step of the 
recurrence each time it receives a source 
and target character. 

P-NAC is an nMOS implementation of 
the systolic array just described. To sim¬ 
plify design of the prototype, the general 
algorithm was restricted by hardwiring the 
evolutionary costs so that c de i(s) = c ins (0 
= 1 and c sub (.s, t) = 0 if s = t and c sub (s,0 
= 2 if s ¥= t. Each chip contains 30 
processing elements, which are snaked in 
three rows of 10 processors. A bypass 
option, which shortens the path through 
a chip to 10 processors, was included so 
that comparisons of short sequences 
would not needlessly pass through the 
entire length of the array. Housed in a 


40-pin, 4.6-mm x 6.8-mm standard 
frame, the device was fabricated using a 
4-micron process by the MOS Implemen¬ 
tation Service (MOSIS). 

By itself, P-NAC is next to useless. A 
Multibus support board forms the link 
between the systolic array on the hardware 
level and the host computer, a Sun 2 work¬ 
station, on the software level. A number 
of benchmarks have been run to test func¬ 
tionality and speed; P-NAC performs 125 
times faster than a DEC VAX 11/785 
when the VAX uses the same dynamic pro¬ 
gramming algorithm. 1 

At pmed, P-NAC's 

poum is 



Two important issues confronting the 
designers of VLSI processor arrays are 
fault tolerance and I/O bandwidth. 
Because P-NAC is a “conservative” chip 
(requiring just over 6000 transistors), the 
former issue is not a major concern; yield 
has generally been near 100 percent. 
Nevertheless, P-NAC did suffer a mask 
defect on its first fabrication run. For¬ 
tunately, the bypass option permitted a 
route around the bad location, which left 
a usable, albeit a shortened, array. This 
simple technique for introducing fault 
tolerance, although not originally intended 
to serve that purpose, will be improved for 
use on future versions. 

More serious is the problem of I/O 
bandwidth. While P-NAC is simpler than 
most parallel architectures proposed in the 
literature, its appetite for data is insatia¬ 
ble; the chips are 10 times faster than the 
support board can drive them. For now, 
this additional power remains untapped. 

Despite P-NAC’s speed advantage, the 
prototype implementation suffers because 
it lacks the flexibility of a software solu¬ 


tion; its notion of similarity is 
preprogrammed and limited. Clearly, 
evolutionary costs must be permitted to 
range over some small set of integer values. 
In addition, it is not necessary for two 
sequences to be similar in their entirety for 
their comparison to yield an interesting 
result. One DNA sequence may resemble 
a subsequence of another (for example, 
one gene contained in a larger chromo¬ 
some), or two DNA strings may share a 
highly similar subsequence (for example, 
a common gene). Given two sequences S 
and T, three questions naturally arise: Are 
S and T similar? Is there a subsequence S' 
of S that is similar to 77 Is there a subse¬ 
quence S' of S that is similar to a subse¬ 
quence T' of 77 

It is not surprising that these variations 
on the basic sequence-comparison prob¬ 
lem can be solved by means of dynamic 
programming algorithms similar to the 
one just described. A single programma¬ 
ble array could be built to handle any or all 
of the aforementioned cases. Work is now 
commencing on such a second-generation 
device, the B-NAC. 
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Integrating Systolic Arrays 
Into a Supersystem 


Wen-TaiLin, Chi-Yuan Chin, and Chung-Yih Ho 
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R ecent studies of supercomputers 
show that the low processor utili¬ 
zation rate displayed by conven¬ 
tional vector machines is mainly the result 
of traffic between the system memory and 
processors. Although small-grain architec¬ 
tures (such as systolic arrays and VLSI 
arrays in general) are designed to ease the 
memory-processor traffic, they suffer 
most in that they lack the flexibility to 
accommodate a wide variety of applica¬ 


tion algorithms. In this article, we propose 
a reconfigurable 2D processor array that 
is a trade-off between the large-grain vec¬ 
tor machines and the small-grain, pure sys¬ 
tolic arrays. The advantages are fourfold. 

• First, when programmable processing 
elements (PEs) are used for it, many exist¬ 
ing systolic algorithms can be directly 
mapped into the 2D array. 

• Second, for large systolic problems 
and in cases where algorithm decomposi¬ 


tion is difficult, it may provide alternative 
computation modes, such as semisystolic 
computation. 

• Third, when the array is part of the 
Line (Link and INterconnection Chip) 1 
VLSI interconnection chip, the array can 
be dynamically reconfigured to accommo¬ 
date a wide range of computational struc¬ 
tures. The mesh-connected Line array can 
also be used to simulate many interconnec¬ 
tion networks that are used for nonsystolic 
processing. 

• Lastly, the modular 2D array can be 
degenerated into a ID array to provide 
fault tolerance. 
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Figure 1. The system architecture of a 4 x 4 Line array. 


The reconfigurable architecture is illus¬ 
trated by the 4x4 Line array of Figure 1. 
Each computation node is composed of 
four Line chips (which form eight chan¬ 
nels, each with 16-bit data paths), two 
VLSI PEs, two control stores (one for the 
PEs and the other for Line chips), and two 
memory banks (see Figure 2). Each node 
has four I/O pairs (designated as North, 
South, East, and West, respectively) con¬ 
nected to its four nearest neighbors and 
two global buses (the X-Bus and Y-Bus) 
connected to the processor controllers 
(PCs). Through the global buses, each PC 
coordinates either a column or a row of the 
processor array. To separate the global 
communication activities from local com¬ 
putational events, the PEs and Line chips 
are controlled independently. To con¬ 
figure the PEs with a large degree of free¬ 
dom, in general, one may use off-the-shelf 
chips. In a computed tomography image- 
reconstruction system, each PE has a 
32-word X 32-bit register file, a 32-bit 
multiplier, and a 32-bit ALU. The PEs are 
microcode-controlled and sequenced by 
the external PCs. When it operates at a 
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10-MHz clock rate, the computation node 
of Figure 2 reaches its peak performance— 
around 40 MFLOPS. 

In the array system, neighborhood com¬ 
munication is provided by the mesh- 
connected network. When a computation 
task does not require interprocessor trans¬ 
fer, the Line resources are allocated so as 
to fully support the pipelining of Line’s 
local dataflows. Each Line node is 
equipped with an independent set of 
sequential commands for setting up the 
Line data paths. The entire array system 
can be configured to match a wide variety 
of computational structures, such as torus, 
systolic hexagon, systolic ring, binary tree, 
triangular array, and so on. Irregular com¬ 
putation graphs, such as those used for 
image rotation and interpolation, can also 
be easily embedded in the 2D array. 2 To 
offer PE permutations that are more 
dynamic, such as those provided in an 
interconnection network, the array system 
simulates a variety of interconnection net¬ 
works by executing the Line “instruction 
set. ” For example, a cube interconnection 
network for N = 32 nodes can be 
“folded’’into a 4 x 4 Line array; it takes 
2(N/2)%-2 Line cycles to simulate the five- 
stage cube interconnection. A single-stage, 
perfect shuffle-exchange network can be 
simulated in %(N/2)^ cycles. Since only 
four I/O pairs are involved, a total of 24 
(that is, 4!) Line control patterns is 
required. 

One often encounters the difficult prob¬ 
lem of algorithm decomposition when one 
is mapping systolic computations into a 
fixed-processor array. A natural way to 
overcome this problem, especially in 
image/signal processing, is to obtain par¬ 
allelism through the partitioning of input 
data. For example, a smooth systolic 2D 
convolution algorithm requires that the 
convolution kernel size be equal to the 
number of processors. By partitioning the 
entire image into subimages and storing 
each subimage in the local memory of a 
separate PE, one can derive a semisystolic 
dataflow that is not hampered by heavy 
cross-boundary data transfers. In Figure 
3, we show that convolution kernel K is 
broken into four parts— K',K x ,K y , and 
fC —when it crosses a subimage bound¬ 
ary. In fact, kernel K can be viewed as 


wraparound subkernels within the 
subimage area of each PE. The convolu¬ 
tion of each subkernel is carried out 
separately, with the partial sums denoted 
as S',S x ,S y , and S z . After all the subker¬ 
nels are computed, S*,S y , and S z are for¬ 
warded to their respective neighbors to be 
summed up as a total pixel value. 

Since the architecture is reconfigurable, 
it is a nontrivial task to develop a user- 
friendly programming environment. 
However, because of the modular nature 
of the system architecture and the ease of 
controlling Line datapaths, we foresee that 
software development will be pursued in 
an empirical manner. The effort will be 


focused first on the writing of application- 
specific libraries, then on the development 
of a high-level language compiler, and 
finally will be targeted at a fault-tolerant 
operating system. 
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The Concept and Implementation of 
Data-Driven Processor Arrays 
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V arious topologies and architec¬ 
tural designs for processor arrays 
have recently been proposed. 
These designs include systolic arrays and 
globally asynchronous wavefront arrays. 1 
Both array types have a computation front 
that propagates according to a predeter¬ 
mined control sequence, and consequently 
these control-driven arrays have proven to 
be very effective for executing highly regu¬ 
lar algorithms like vector and matrix oper¬ 
ations. There are, however, many 
computationally demanding problems 
that do not exhibit high regularity and may 
therefore prove unsuitable for these 
control-driven arrays. Still, many of these 
problems have an inherent parallelism, 
and it should be possible to exploit this 
parallelism by means of processor arrays 
that can provide a high degree of 
pipelining. 


In Koren and Silberman, 2 a new 
approach to array design was proposed— 
that of developing specialized array 
architectures that would be capable of 
executing any given algorithm. In this 
approach, the algorithm is first repre¬ 
sented in the form of a dataflow graph 
(DFG) and is then mapped onto the array. 
The processing elements (PEs) in the array 
execute the operations included in the cor¬ 
responding nodes (or subsets of nodes) of 
the DFG; regular interconnections of these 
PEs serve as edges of the graph. 

In general, when an arbitrary algorithm 
is executed on an array there is no regular 
propagation of computation fronts. 
Hence, to speed up the execution of arbi¬ 
trary algorithms, a more flexible array is 
needed. Such an array should make pos¬ 
sible the generation of new computation 
fronts and their cancellation at a later time 


(the time depends on the arriving data 
operands). We therefore call these arrays 
data-driven arrays. The cell (that is, the 
PE) in these arrays should be capable of 
testing for the presence of its operands and 
executing only the instructions for which 
all the necessary operands have arrived. 
Thus, the order in which instructions are 
executed is data dependent, and the cell is 
truly a data-driven PE. 

Processor-array 
architecture and 
principles of its 
operation 

The feasibility of designing control- 
driven arrays was never in question; 
several types of PEs for these arrays have 
already been designed (see Fisher et al. 3 ). 
However, the degree of hardware com¬ 
plexity required to add the data-driven 
property was not clear to us. Therefore, we 
made a preliminary design of an appropri¬ 
ate processing element. The result of this 
design is very encouraging; The total hard¬ 
ware complexity of the cell—which is 
presented next—is less than 9000 transis¬ 
tors in NMOS technology. This low com¬ 
plexity should make possible the 
fabrication of a VLSI chip containing 
about 50 to 100 cells. The first phase of the 
design has already been completed and is 
reported in Peled. 4 A group of graduate 
students in the Dept, of Electrical and 
Computer Engineering at the University of 
Massachusetts in Amherst is now finaliz¬ 
ing the detailed design and layout of the 
VLSI chip. 

The proposed floorplan of the chip is 
shown in Figure 1. The floorplan contains 



Figure 1. The floorplan of the processor-array chip. 
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data-driven cells arranged in rows in such 
a way that the typical cell has six immedi¬ 
ate neighbors in a hexagonally connected 
processor array. The cells communicate 
with an external host computer through 
buses, as shown in Figure 1. In this host, 
the programs of the individual cells are 
prepared and are then distributed to the 
cells. The host also supplies the input oper¬ 
ands and accumulates the final results. 
Unlike host-to-array communication in 
control-driven arrays, restricting the host- 
to-array communication to passing only 
through boundary cells can slow down the 
array’s operation substantially. Conse¬ 
quently, we have decided to allow each cell 
to communicate directly with the host. 

The functional blocks that make up the 
basic cell are depicted in Figure 2. There 
are six registers, Ri through R 6 , which are 
connected to a common internal bus; each 
of them is also directly connected to its cor¬ 
responding register in one of the six neigh¬ 
boring cells (see Figure 2). The latter 
connection is under hardware control, 
since its timing is crucial. In our design, 
this data transfer takes a single clock and 
is done in parallel with all other operations 
in the cell. 

The instruction memory contains six 
instructions that specify the cell’s opera¬ 
tions; one instruction per register (out of 
/?! through R 6 registers). Having six 
instructions per cell increases the level of 
utilization of the cell and leads to a lower 
overall execution time. 

The flag array is a uniquely designed 
block that makes possible the data-driven 
operation of the cell. The instructions in 
the cell are not executed in any predeter¬ 
mined order. Instead, the arrival of all 
operands for a certain instruction enables 
the cell to execute set instructions. The 
flags monitor the movement of operands 
both within the array and in and out of the 
cell. For each register there is a flag indicat¬ 
ing whether the register has an operand or 
whether it is empty and can receive a new 
operand. Only a single cycle is needed to 
test these flags to determine whether an 
instruction is ready to be executed. 

The other functional units in the cell are 
self-explanatory. 

In parallel with designing the array, a 
procedure for mapping DFGs onto data- 


driven arrays has been developed and pro¬ 
grammed by Mendelson and Silberman. 5 
In this procedure, the user’s program (in 
VAL) is translated into a DFG and then 
mapped onto a finite array of PEs. The 
procedure allows us to take advantage of 
the data-driven cell’s capability of per¬ 
forming up to six operations. A node in the 
original DFG includes only a single oper¬ 
ation. Therefore, we may combine up to 
six simple neighboring nodes and map 
them onto a single PE. Finally, the array 
is partitioned into several subarray chips 
according to the technology-imposed limi¬ 
tations on the number of PEs per chip. 

In summary, the idea of directly map¬ 
ping an arbitrary algorithm onto a VLSI 
array has been shown to be feasible. Fur¬ 
ther research is now being carried out to 
prove the effectiveness and practicality of 
this approach. 
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Build a model of the application world before you begin designing your 
database-application system 


The design of database-application 
systems is such a difficult task that 
professors write books about it, 
researchers attend conferences on it, 
and practitioners struggle with it. Natu¬ 
rally, the field is full of controversy. 
Should you start right away with actual 
system design, or should you begin and 
just keep on with prototyping of the 
system until you get it (almost) right? If 
you begin with design, should you first 
determine what the system is supposed 
to do (the function-first approach), or 
should you start with what it should 
remember (the data-first approach)? 
And when you design the data struc¬ 
tures, should you start with the fields 
and normalize them into records, or 
should you start with the records and 
incorporate the fields? Or should you 
just provide a good, fourth-generation 
application-development environment 
and let the users worry about it all? 

I have recently given this whole area 
considerable thought and have come up 
with a simplified approach to building 
database-application systems that 
bypasses most of these issues. My 
approach is suitable for use with fourth- 
generation application-development 
tools, and it is especially suitable for 
building prototypes for use in establish¬ 
ing system requirements. It is a simpli¬ 
fied approach because it defers all 
consideration of what the system is sup¬ 
posed to do, and what the system is 
supposed to remember, until a model of 
the system’s application world is devel¬ 
oped. This model describes the subject 
matter—which is what the system is all 


about. Such a model can be imple¬ 
mented quite directly to serve as a 
generic system prototype, that is, a pro¬ 
totype that is applicable to any system 
dealing with the same subjects. 

This simplified, real-world-modeling 
approach to prototyping and designing 
of database-application systems 
employs two primitive concepts, those 
of 

• objects, which exist over time and 
which endure changes, and 

• events, which happen at specific 
times and which cause changes. 

Objects exist in the real world and 
have life histories consisting of the 
events in which the objects participated. 
The term “objects” does not necessar¬ 
ily imply individual, physical existence. 
Objects can be activities, places, organi¬ 
zations, systems, or collections of 
things—anything with a history that 
users are interested in tracking. 

Events always involve objects because 
they cause objects to change. Events 
can be viewed as occurring instantane¬ 
ously (that is, at particular moments of 
time). They don’t have life histories, 
and once they have occurred, they don’t 
undergo change (except, perhaps, to be 
corrected). 

Although the world certainly doesn’t 
come to us already sliced up into 
objects and events, human language is 
the tool that does this slicing, so the 
concepts of “objects” and “events” are 
probably more familiar to most users 
than the alternative concepts of “rela¬ 
tionships” and “states,” on which most 


other approaches to system design are 
based. 

The model of the application world 
could be constructed in one of two 
ways. It could be built as a catalog of 
all the objects and events of interest. Or 
it could be built as a working prototype 
in which objects are represented as 
object records containing only identify¬ 
ing information and events are repre¬ 
sented as event records containing 
descriptive information and references 
to the objects involved. For example, an 
object record for a customer might con¬ 
tain a name, an address, and a tele¬ 
phone number. An event record for a 
purchase might contain a customer 
number, a product number, the quan¬ 
tity of the item(s) purchased, the total 
price, and a date/time stamp. 

A prototype system would provide 
facilities for registering objects, record¬ 
ing events, and processing queries. It 
would also allow users to identify 
objects in a variety of ways—for exam¬ 
ple, by serial number (customer 
123456), by specified identifying 
properties (customer Smith), or by 
specified relationships (the customer 
who owes the most)—by means of arbi¬ 
trary queries. Record-updating capabili¬ 
ties would be needed only for making 
corrections to the stored data. 

This world model, or prototype, 
would define what the system is all 
about, and it would be capable of 
answering all reasonable queries involv¬ 
ing the recorded objects and events. 
Because of this generality, there would 
probably be some striking inefficiencies 
in processing and storage. These would 
be resolvable only by considering what 
the system is supposed to do and what it 
therefore needs to remember. 

System functions can always be 
expressed in terms of relationships 
among the system’s objects and events. 
Unfortunately, computing some rela¬ 
tionships can, in principle, require 
processing the system’s entire past his- 
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tory of events. So, to avoid repeating 
such heavy processing for every query 
that demands it, it is often necessary to 
store certain relationships and to update 
these stored relationships as events are 
reported to the system. Such stored 
relationships are expedients, however, 
and only serve to speed processing. 

In many cases, it is simply not practi¬ 
cal to keep on file a complete temporal 
and historical record of all past events, 
regardless of the obvious advantages of 
doing so. Instead, events must be sum¬ 
marized and stored as object states , and 
the supporting event records must either 
be purged or relegated to archival stor¬ 
age. These object states, which summa¬ 
rize the object’s past history of events, 
must also be updated as events are 
reported to the system. Such states are 
also expedients, however, and serve 
mainly to reduce storage requirements— 
though they may also speed processing. 

The benefits of stored relationships 
and object states are clear: faster 
processing and reduced storage. The 


costs are equally clear (though too often 
they are ignored): added complexity, 
reduced fidelity, and limited func¬ 
tionality. 

The added complexity arises both 
from the need to process incoming 
event reports as database-update trans¬ 
actions when a single event typically 
requires the updating of multiple rela¬ 
tionships and states, and from the diffi¬ 
culty of undoing and redoing these 
updates when an event report must later 
be corrected. 

Reduced fidelity, a condition in 
which the stored relationships and 
object states do not properly reflect the 
events that have transpired, is an almost 
inevitable consequence of having two 
different representations of the world 
(event records on the one hand and 
stored relationships and object states on 
the other). Discrepancies will creep in, 
for many all too likely reasons; such 
reasons include transactions that fail to 
complete and events that are reported 
and/or processed out of order. 


Limitations on system functionality 
are a consequence of replacing event 
records with summary object states. 
Some information is necessarily lost, as 
illustrated by a frustrated user’s plea, 
“Don’t just tell me how much this cus¬ 
tomer currently owes, tell me how much 
business he’s done with us over the last 
few years.” 

The message for you system designers 
in all of this is that event data is sum¬ 
marized in variable-state object records 
and in stored relationships at your peril. 
Make sure your users fully understand 
the trade-offs involved in such sum¬ 
marizing, particularly the resulting limi¬ 
tations on system functionality. A good 
way to communicate these trade-offs is 
to build an object/event-based proto¬ 
type, introduce stored relationships and 
object states gradually, and let your 
users experience the trade-offs. 


Chris Shaw 
Trillium 


Confessions of a used-program salesman—excuses 


Did you ever reach a point of frustra¬ 
tion when you just wanted to scream? 
Well, the used-program business has 
had its ups and downs, and lately I’ve 
been in a slump. Since I opened my new 
Parts Department, I have been running 
into all kinds of problems convincing 
my old customers to take advantage of 
these reusable components. My cus¬ 
tomers always seem to find excuses * for 
buying a new program instead of invest¬ 
ing in some of my well-used or refur¬ 
bished parts. I swear that I’ve heard 
every excuse in the book; in fact, I’ve 
decided to write them down along with 
translations of what I think each cus¬ 
tomer is really saying. The following, 
then, are the most popular excuses for 
not reusing software. 

1. Only wimps use someone else’s 
software. 

Translation: If I were to reuse some¬ 
one else’s software, then I’d be admit¬ 
ting that I couldn’t write software 
myself. 

2. Reuse of software destroys the ability 
to create it. 

Translation: It’s more fun to do it 
myself. 


*1 would like to thank Ed Berard of EVB Software 
Engineering for sharing some of his favorite 
excuses with me. 


3. Introduction of reusable software 
will eliminate my job. 

Translation: As long as I am meas¬ 
ured by how many lines of code I write, 
why should I do something that reduces 
my perceived productivity? 

4. Reusable software cannot be 
efficient. 

Translation: Why should I pay for all 
the additional baggage that someone 
else puts in to check software for error 
conditions and to add extra parameters 
that I’ll never use? Besides, I know a 
better algorithm anyway. 

5.1 don’t want to be the first. 

Translation: Let someone else work 
out the bugs and pay the start-up cost 
to create the parts initially. 

6. Trying to reuse someone else’s soft¬ 
ware is a waste of my time. 


Translation: Why should I pay for 
someone else’s mistakes? The software 
probably has bugs in it, probably isn’t 
very well documented, and probably 
won’t work for my application. I’ll 
probably spend more time trying to fig¬ 
ure out what it does—whether or not it 
works—and how to modify it than I 
would writing it myself in the first 
place. 


7.1 don’t believe that software reusabil¬ 
ity is a viable concept. 

Translation: I am too comfortable 
developing software the way I’ve been 
developing it for the last n years. 
Besides, I’ve already learned structured 
programming; isn’t that enough? 

There are many technical issues 
associated with making software reuse 
feasible. Those most often cited include 
determining what should be reused, 
how to design for reuse, how to design 
with reused software, and how to clas¬ 
sify, store, and retrieve software com¬ 
ponents for reuse. However, the bottom 
line, I have found, is that the most 
prevalent excuses for not reusing soft¬ 
ware are nontechnical; they are socio¬ 
logical, psychological, or administrative. 
What we are faced with is an inherent 
distrust of another person’s software. 
(What does that say about the general 
reputation of software quality?) 

I often wonder what it will take for 
us to learn that if we can’t do it right 
the first time, we can always do it over 
and over...and so the story continues. 


Sincerely, 

Will Tracz (your friendly 
used-program salesman) 
Computer Systems Laboratory 
Stanford University 
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Editor: Helen M. Wood, National Bureau of Standards, B154 Technology, Gaithersburg, MD 20899; (301) 975-3240. 


Standards have life cycles, too 

Laurel V. Kaleda, Chair, CS- 

IEEE Standards Coordinating Com¬ 
mittee 

The Computer Society of the IEEE 
(CS-IEEE) sponsors development of 
standards in the many varied areas of 
computer technology. As a member 
society of the IEEE, the CS-IEEE uses 
the IEEE Standards Board’s policies 
and procedures to guide the develop¬ 
ment of these standards. 

All standards developed within the 
IEEE have a “life cycle” that is similar 
to the product life cycle referenced quite 
often for computer products. I define 
the life cycle for IEEE standards as 
“the period of time that starts when a 
standard is conceived and ends when 
the standard is no longer available for 
use.” (This definition is based on the 
software life cycle definition in 
ANSI/IEEE 729: Glossary of Software 
Engineering Terminology, IEEE, New 
York, 1983.) The life cycle includes a 
requirements phase, design phase, 
implementation phase, test phase, oper¬ 
ation and maintenance phase, and a 
retirement phase. To provide a better 
understanding of what the development 
process for standards involves and how 
it occurs, this article discusses the pur¬ 
pose, importance, and the participants 
in each of the life cycle phases of 
standards. 

Standards requirements phase— 
creating the PAR. The sponsor (usually 
a technical committee [TC] or TC stan¬ 
dards subcommittee) proposes an idea 
for a standard and determines whether 
there is enough interest in the idea to 
attract the particaticipation of 
individuals to form a working group. 
Alternatively, a group of individuals 


with a specific idea for a standard finds 
an appropriate TC to sponsor the work. 
The working group, with the assistance 
of the sponsor, defines the scope and 
purpose of their efforts and documents 
these in a Project Authorization 
Request (PAR). The PAR is reviewed 
by the sponsor and the IEEE Standards 
Board, who check that the proposal is 
within the scope and purpose of the 
IEEE, that the project is assigned to the 
appropriate member society (e.g., the 
Computer Society), that appropriate 
coordination with other standards- 
making organizations is provided for, 
and that the proposed effort does not 
duplicate or conflict with other existing 
standardization efforts. 

Standards design phase—structuring 
the standard. The working group 
develops the basic structure of the 
standard that they wish to develop. The 
issues of who will be the users of the 
standard, how much information 
should be in the standard, and what 
format should be used to present that 
information are all addressed by the 
group. The outline for the document to 
be developed is created and approved. 
The working group’s general operating 
procedures are also developed at this 
time, if necessary. These are set up to 
enhance the efforts of the group, not to 
make unnecessary restrictions on mem¬ 
bership or participation. 

Standards implementation phase- 
writing the standard. Based on the 
structure already determined, the work¬ 
ing group adds the words and para¬ 
graphs that will fulfill the originally 
agreed upon scope and purpose of the 
standard. The working group eventually 
reaches a consensus that they have com¬ 


pleted the writing work. This phase has 
sometimes been called the work, work, 
and rework phase. Consensus among 
very opinionated professionals is not 
very easy to achieve. 

Standards testing phase—balloting 
the standard. The completed work is 
now balloted by the sponsor. This is the 
true test of the work, with the evalua¬ 
tion performed by all interested parties. 
At the completion of the 30-day ballot¬ 
ing period, all comments are analyzed 
and resolved , and all changes resubmit¬ 
ted to the balloting group for the mem¬ 
bers to reconsider their ballot position. 
If the final ballot result is more than 75 
percent positive, with more than 75 per¬ 
cent of the ballots returned, the ballot is 
successful, and it is passed on to the 
IEEE Standards Board. This board 
validates that the balloting process has 
been carefully and correctly completed. 
With Standards Board approval, the 
working group’s document becomes an 
IEEE Standard. 


Standards operations and main¬ 
tenance phase—using the standard. The 

approved standard is now published 
and made available for use. Comments 
are collected, and all requests for 
interpretation are formally answered by 
the IEEE Standards Board. Every five 
years, or earlier as deemed necessary by 
the comments received by the IEEE 
Standards Board about the standard or 
by changing technology, the standard 
must be reevaluated and either reaffirmed 
as is, revised and reballoted, or with¬ 
drawn. The sponsor is responsible for 
ensuring that the reassessment is done 
on a timely basis. 

Standards retirement phase— 
withdrawing the standard. If the group 
reevaluating the standard determines 
that it is obsolete, for whatever reasons, 
the standard is withdrawn from public 
availability. No further copies are 
printed or sold, and a public notice is 
published regarding the withdrawal. 

The standards life cycle is similar to 
that of almost any other product cycle. 


Why is there so much interest in standards? What are the issues— 
economic, technical, political—involved? Who are the players? These are 
among the topics that will be considered in the Standards Department. 

Short articles on standards-related topics are sought for publication in this 
department. For details regarding scope and length, contact the department 
editor at the address given above. 
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A standard begins as a requirement and 
evolves into a document that reflects a 
consensus—a substantial agreement, 
but not necessarily a unanimous one. It 
must continue and it must be main¬ 
tained to reflect that consensus until it 
is no longer applicable. 

You, the readers of this column, can 
choose to become involved with the 
life cycles of the Computer Society’s 
standards work. The process that the 


standards development life cycle phases 
use is an open process, one in which 
you can be involved based on your 
professional interest and experience. 
With every standards activity that is 
announced there is a contact name or 
names given—go ahead and place a 
phone call, or send a note! You may 
find a new way of enhancing your 
professional vitality—through partici¬ 
pation in standards development. 


DOD/industry fiber optics standards symposium 


The Department of Defense, Ameri¬ 
can National Standards Institute, and 
the Electronic Industries Association 
are jointly sponsoring a three-day sym¬ 
posium devoted exclusively to fiber 
optic systems and components stan¬ 
dardization. It will take place at the 
Sheraton National Hotel in Arlington, 
Virginia on December 7-9, 1987. 

The symposium is intended for those 
in government and industry who are 
developing fiber optic standards; system 
and component designers and users; 
and program managers responsible for 
development, administration, and 
installation of these systems. 

The program includes six sessions 
running serially over the three-day 
period. Each will be chaired jointly by 
government, military, and industry 
leaders. Discussion topics include 

• standards for digital and analog fiber 
optic systems and subsystems; 

• active components, laser and LED 


sources, and PIN and APD detector 
parameters; 

• passive components, tolerances, and 
environmental constraints; 

• fiber optic systems, subsystems, and 
component test methods; 

• national and international quality 
assurance programs; and 

• DoD/commercial standards assess¬ 
ment and evaluation. 

Leading industry and government 
keynote speakers will discuss recent 
technological developments and market 
directions at the opening session and 
during the three planned luncheons. 

Information regarding registration 
for the symposium or concerning the 
presentation of papers should be 
referred to the El A coordinator, Hal 
Berge, 2001 Eye Street, N.W., Wash¬ 
ington, DC 20006; or call (202) 
457-8737. The registration fee of $225 
covers all meeting costs, including 
luncheons and proceedings. 


Project on unrecorded reversible optical digital disk 


Increasing demand for the storage of 
data in machine-readable form has 
prompted equipment manufacturers to 
pursue new technologies offering the 
promise of greatly increased storage 
density at a comparatively reduced cost. 
One of these new technologies is the 
optically recorded, reversible, digital 
disk. The nominal 130-mm disk size is 
regarded as large enough to store an 
adequate amount of information, and 
yet small enough to put into a small 
office, personal computer environment. 

Accredited Standards Committee X3 
on Information Processing Systems has 
approved a new project to develop an 
American National Standard for 
Unrecorded Reversible Optical Media 
Unit for Digital Information Inter¬ 
change Nominal 130 mm (5.25 inches) 


diameter. This standard will benefit 
users requiring digital data storage of 
small-office files, word-processing and 
text manipulation, and small-business 
and personal computer data files. 

The standard will allow this remova¬ 
ble media unit to physically be written 
and read on any equipment that adheres 
to the minimum requirements. The 
actual development work will be carried 
out by Technical Committee X3B11, 
Optical Digital Data Disks. X3B1 l’s 
target date for completion of this stand¬ 
ard is geared for January 1989. For 
more information, contact the X3B11 
Chair, Joseph S. Zajaczkowski, Chero¬ 
kee Data Systems, 1880 South Flatirons 
Court, Suite H, Boulder, CO 80301; 
(303)449-1239. 
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Computer groups oppose 
say hurt consultants 

Tom Szalkiewicz, Assistant Editor 


The IEEE and the Independent Com¬ 
puter Consultant Association have 
joined hands to develop strategies to 
repeal or amend Section 1706 of the 
Tax Reform Act of 1986. 

This section eliminates the safe har¬ 
bor provisions of the 1978 tax act that 
ensured that independent computer 
consultants would not be classified by 
the Internal Revenue Service as 
employees of the companies these con¬ 
sultants contracted with. The provisions 
entitled consultants to the same tax 
breaks that self-employed individuals 
enjoy and exempted companies from 
withholding payroll taxes from their 
payments to the consultants. 

However, Provision 1706 of the new 
law subjects these consultants to the 
common law tests that prevailed before 
the safe harbor provisions. These com¬ 
mon law tests are subject to differing 
interpretations, resulting in disputes 
among consultants, clients, and the 
IRS. 

“The confusion over differing IRS 
rulings interpreting common law 
requirements was killing the computer 
consulting industry,” commented 
Jeffrey I. Sachs, chairman of the 
ICCA. 

Carleton A. Bayless, IEEE vice presi¬ 
dent for professional activities and 
chairman of the IEEE’s US Activities 
Board, has an even stronger argument 
against 1706: “The short term result of 


Neural systems program 

The AT&T Foundations has awarded 
the California Institute of Technology a 
three-year, $300,000 grant to help sup¬ 
port a new program in computation and 
neural systems. 

The principle behind neural systems 
is that they can recognize and remember 
by association the same way organic 
systems do—and at incredible speeds. 

According to AT&T vice president 
William Clossey, the lightning speed of 
neural systems depends on designing 


tax law changes they 


this [new] legislation has been to create 
confusion and uncertainty about the 
employment status and tax liability of 
thousands of engineers and computer 
specialists. This confusion and uncer¬ 
tainty is having a chilling effect on 
innovation and productivity throughout 
the nation’s research and development 
community.” 

Joe Gale, a staff member in the office 
of Senator Patrick Moynihan (D-NY), 
who introduced Provision 1706, argued 
that this new provision is far more 
equitable than the old provisions: “The 
problem with the safe harbor provisions 
was that they were not being applied 
consistently and were not available 
across the board, whereas the common 
law standards are.” 

Gale added that some companies and 
individuals may have used the safe har¬ 
bor provisions to avoid their tax respon¬ 
sibilities. He added that legislation will 
soon be introduced that would ensure 
that the IRS apply common law stan¬ 
dards fairly and properly. 

This approach notwithstanding, both 
the IEEE and ICCA will press ahead 
with their program to support bills in 
the House and Senate to provide a two- 
year delay in the effective date of Sec¬ 
tion 1706. This will give Congress time 
to hold hearings on this tax reform 
section—“a measure adopted without 
due consideration of its consequences 
for independent computer consul¬ 
tants,” according to ICCA spokesman 
Barry Dunnegan. 


created at Caltech 

computer chips called neural networks 
that mimic the way some brain cells 
retrieve stored information and solve 
problems. 

Caltech’s new program will combine 
aspects of neurobiology, computation, 
information theory, VLSI technology, 
materials science, and studies of the 
richness of complex systems. 

One of the world’s leading experts in 
this field is John Hopfield, a professor 
of chemistry and biology at Caltech. 


DoD Ada directive 
formalized 

On April 2, 1987, Deputy Secretary 
of Defense William H. Taft, IV signed 
DoD Directive 3405.1, making Ada the 
single, common computer programming 
language for certain DoD resources, 
including those in all branches of the 
military. 

This directive, “Computer Program¬ 
ming Language Policy,” supercedes 
DoD Instruction 5000.31 (see W. 

Myers, “Ada: First Users—Pleased; 
Prospective Users—Still Hesitant,” 
Computer, March 1987, pp. 68-73). Its 
main effect will be to limit the number 
of high-order programming languages 
used within the Department of Defense, 
thus furthering DoD’s plans to use Ada 
for software development. Further¬ 
more, Ada must be used in defense 
computer resources that are part of 
intelligence systems, for the command 
and control of military forces, or as an 
integral part of a weapon systems. Ada 
shall be used for all other applications 
except when the use of another 
approved higher order language is more 
cost-effective over the application’s life 
cycle, the directive states. 

When Ada is not used, only certain 
other standard higher-order program¬ 
ming languages shall be used to meet 
custom-developed procedural language 
programming requirements. These 
standard languages include C/Atlas, 
Cobol, CMS-2M, CMS-2Y, Fortran, 
Jovial (J73), Minimal Basic, Pascal, 
and SPL/1. 

Programming languages other than 
Ada that were authorized and being 
used in full-scale development before 
the directive was issued may continue to 
be used through deployment and for 
software maintenance, but not for 
major software upgrades. The DoD’s 
definition of a major software upgrade 
is the redesign or addition of more than 
one-third of the software. 

The text of this directive is available 
on the DDN Ada 20 host under the file 
name <ada-info>3405-l.hlp. It will 
also be available on the PC bulletin 
board under the file name 3405-1.hip. 
The phone number for the PC bulletin 
board is (202) 694-0215. 
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Results of 1985-86 Advances made in superconducting computer 

Taulbee Survey chips 


Completed in 1986, this survey exa¬ 
mines the production and employment 
I of PhDs and faculty of 117 PhD- 
' granting computer science/engineering 
departments in the United States and 
> Canada (107 US and 10 Canadian) dur- 
j ing the academic year 1986-86. High- 
I lights of this survey include: 

• The 117 departments produced 412 
PhDs—an increase of 20 percent over 
the previous year. Of the 412, 191 
went to academia, 145 to industry, 19 
to government, and 35 overseas; three 
were self-employed and four unem¬ 
ployed. 

• The departments expect to produce 
652 PhDs next year. This is far too 
optimistic; 480 to 490, again an 
increase of 20 percent, is more likely. 

• PhD qualifying-exam passage was at 
the same rate as last year: 7.3 students 
per department. 

• The 117 departments had 2040 
faculty members: 780 assistant, 563 
associate, and 707 full professors. 

• The departments reported hiring 232 
regular faculty and losing 174 (to 
retirement, death, other universities, 

j and nonacademic positions). 

• The 117 departments want to grow 
from 2173 faculty members to 2973 
by 1991-92, an increase of 37 percent 
in five years, at an average rate of 
about 1.4 per department per year. 
(Last year, 103 departments expected 
a growth of two faculty members per 
department, but they grew only one 
per department.) 

According to the author of the sur¬ 
vey, David Gries of Cornell University, 
“Last year’s growth of 20 percent in 
PhD production, together with a similar 
expected growth this year, gives some 
optimism that the field can produce 600 
PhDs per year within four years. This 
would create a reasonable supply of 
qualified faculty candidates, which has 
been lacking in the past 10 years. In 
fact, in three or four years one might 
find many more new PhDs taking posi¬ 
tions in the lower-ranked PhD depart¬ 
ments and in non-PhD-granting 
departments.” 


Researchers at the IBM Thomas J. 
Watson Research Center in Yorktown 
Heights, New York, have made the first 
thin-film superconducting devices that 
operate at temperatures high enough to 
be practical and useful. 

Superconducting materials lose all 
resistance to electricity below a specific 
temperature. A class of copper-oxide 
materials developed last year by 
researchers in Switzerland exhibit super¬ 
conductivity at unprecedented high tem¬ 
peratures. 

New robotics/AI exhibit 

The Computer Museum in Boston 
has opened a new exhibit devoted to 
artificial intelligence and robotics. 
“Smart Machines” will include hands- 
on displays featuring robots; expert sys¬ 
tems that give directions, choose wines, 
or play chess; and “smart” machines 
used by artists to draw or make music. 

The 4000-square-foot exhibit was 
funded by Russell Noftsker, chairman 
of Symbolics, Inc., along with 14 other 
founders of symbolics, and C. Gordon 
Bell, director of the National Science 
Foundation’s computer research 
division. 

According to a museum spokesman, 
the exhibit’s robot theater will present 
the history of real and fictional robots, 
and visitors will be able to race real 
robots through an obstacle course or 
manipulate a robot arm to get prizes. 
Visitors can use the expert systems to 


“The Japanese should not produce 
more semiconductors than the market 
can bear and should avoid pressure to 
sell semiconductors at less than 
reasonable prices,” recommended John 
M. Richardson, chairman of the Com¬ 
mittee on Communications and Infor¬ 
mation Policy of the IEEE. Addressing 
both short- and long-term aspects of the 
problem, Richardson’s committee and 
the IEEE Committee on US Competi¬ 
tiveness recommended that Japan 
should agree to “quantitative levels” of 


Made from these copper oxides, the 
new devices are composed of two thin- 
lm Josephson junctions, are only one 
one-hundredth the thickness of human 
hair, and are superconducting at up to 6 
K ( —337°F). 

IBM claims these SQUIDs (supercon¬ 
ducting quantum interference devices) 
are the most sensitive magnetic detec¬ 
tors known to science. 


at Computer Museum 

find the best way home or to select a 
wine for their dinner. Computer¬ 
generated art and music will illustrate 
the growing use of smart machines in 

“Smart Machines” will be a perma¬ 
nent addition to The Computer 
Museum’s hands-on and historical com¬ 
puter exhibits. Underlining the impor¬ 
tance of this addition to the museum’s 
collection, Joseph F. Cahen, the 
museum’s executive director, said, 

“ ‘Smart Machines’ adds a major new 
dimension to the Computer Museum’s 
coverage of the computer world.” 

Located at 300 Congress St. in Bos¬ 
ton, the museum is open seven days a 
week, from 10 a.m. to 6 p.m. during 
July and August. It is closed Mondays 
from September through June. For 
information, call (617) 426-2800. 


access to its own markets by US semi¬ 
conductor producers. 

The committees also recommended 
long-term actions addressing the US 
side of the problem. To achieve interna¬ 
tional competitiveness, America must 
address such issues as capital forma¬ 
tion; product, process, and equipment 
development; improvement of manu¬ 
facturing efficiency; increasing competi¬ 
tive quality and reliability; and 
“imaginative management initiatives.” 


IEEE committees offer recommendations to end 
semiconductor crisis 
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Editor: Sallie Sheppard, Dept, of Computer Science, Texas A&M University, College Station, TX 77843; (409) 845-5466 


SIGDA/DATC scholarships presented 
at Design Automation Conference 


Design automation scholarships of 
$8000 each were awarded to the Univer¬ 
sity of Utah and to the Virginia Poly¬ 
technic Institute and State University on 
Monday, June 29, in Miami Beach, 
Florida. The awards, presented during 
the 24th ACM/IEEE Design Automa¬ 
tion Conference, also included a $7000 
renewal of Portland State University’s 
1986 scholarship. 

Six other applicants received $1000 
library support grants. They are the 
University of Cincinnati, North Texas 
State, North Carolina State, the Univer¬ 
sity of Colorado, Boulder, the Univer¬ 
sity of Texas, Dallas, and the University 
of Virginia. 

The scholarships, aimed at support¬ 
ing graduate research and study, are a 
joint project of ACM’s Special Interest 
Group on Design Automation and the 
Computer Society’s Technical Commit¬ 
tee on Design Automation. The $8000 
scholarships are funded jointly by 
ACM-SIGDA and CS-IEEE. The $7000 
renewal and $1000 library grants are 
funded by SIGDA. 

Selection process. The scholarships 
are awarded directly to a university to 
support one or more design automation 
graduate students as outlined in the 
university’s proposal. Nineteen 
proposals were received by the April 
deadline, according to Herschel H. 
Loomis, Jr., of the Naval Postgraduate 
School in Monterey, California, who 
chaired the selection activity. 

The selection committee was com¬ 
posed of Charles Radke of IBM and 
CS-IEEE, Charles Shaw of Intersil/GE 
and ACM-SIGDA, Paul Weil of Silvar- 
Lisco and a member of both sponsoring 
organizations, and Loomis. Each com¬ 
mittee member ranked all proposals 
independently. When compared for the 
final selection, the rankings showed a 
high degree of committee agreement, 
Loomis said. 

Renewal award. Portland State’s 
renewal award will be used for research 
on a logic design machine composed of 
an IBM PC/AT with many coprocessor 
and interface boards to the Multibus, 


each containing custom VLSI chips 
designed at the university. It will also 
support work on DIADES, a design 
automation system. 

The faculty sponsor for the projects 
at the university’s Portland Center for 
Advanced Technology is Marek A. Per- 
kowski, an associate professor of elec¬ 
trical engineering. The students are 
Prahlad Gokul, Bogdan Gaikowski, 
and David Smith, who was a 1986 
recipient. 

According to Loomis, the 1986 award 
increased local industry’s participation 
and equipment donations and expanded 
the university’s set of computer-aided 
design courses. Results were reported at 
an ACM/IEEE workshop held in Santa 
Barbara in a paper titled “Beyond 
Behavioral Synthesis.” 

1987 awards. The University of 
Utah’s scholarship will help fund 
research on an intelligent silicon com¬ 
piler for generating fast VLSI architec¬ 
tures. Single-chip computers will be 
generated by a silicon compiler system 
optimized for selecting the highest 
speed architecture to solve a particular 
problem. The system will include a 
knowledge-based design expert for 
guiding the choice of the architecture 
and modules to generate, configure, 
compact, and simulate the design 
chosen. 

The faculty sponsor is Kent F.Smith, 
a professor of computer science. Jun 
Gu is the student involved in the work. 

The Virginia scholarship will be 
devoted to research on computer-aided 
test generation of CMOS circuits and 
on distributed design automation 
algorithms. The test research focuses on 
switch-level test generation to address 
the inadequacies of traditional gate- 
level circuit models and stuck-at faults. 
The large amount of computation 
expected will run on a network of work¬ 
stations that, when not being used inter¬ 
actively, will be used as a distributed 
system for test generation. 

Faculty sponsor Scott F. Midkiff will 
work with student researcher Stuart W. 
Bollinger on Virginia’s award-winning 
proposal. 


Technical Committee 
on Computer Packaging 

John W. Balde 

Vice Chairman, Liaison 

The Computer Packaging Technical 
Committee is one of the Computer 
Society’s TCs that deals only with the 
hardware and the resultant electrical 
performance of electronic systems. It 
has a minimal interest in architecture 
and software, and its interest in semi¬ 
conductors, resistors, capacitors, and 
other components is only in how to use 
them. 

At the time of its formation, this TC 
was the only organization concerned 
with computer packaging. But its mem¬ 
bers have helped other organizations 
address the needs of system packaging 
engineers so that packaging issues are 
now addressed by the National Elec¬ 
tronics Packaging Conference (NEP- 
CON), the Electronic Components 
Conference (ECC), the International 
Society for Hybrid Manufacture 
(ISHM), the Institute for Interconnect¬ 
ing and Packaging Electronic Circuits 
(IPC), the International Electronic 
Packaging Society (IEPS), International 
Electronic Manufacturing Technologies 
(IEMT), and the IEEE Components, 
Hybrids, and Manufacturing Technol¬ 
ogy (CHMT) Society. Computer Pack¬ 
aging TC members are active in all 
those societies and organizations, and 
help them run their conferences, semi¬ 
nars, and workshops. 

Workshops and task forces. Because 
of all these other activities, the Com¬ 
puter Packaging TC confines itself to 
sponsoring regional bimonthly meet¬ 
ings, which alternate between the East 
and West Coasts, and to organizing 
invitation-only workshops. Since the 
late 1970’s, Spring Workshops have 
been held on the East Coast in even- 
numbered years and on the West Coast 
in odd-numbered years—for example, 
the spring 1986 meeting in Split Rock, 
Pennsylvania, and the recent May 1987 
meeting in Solvang, California. A 
newly organized Japan workshop was 
held at Oiso, near Tokyo, in January 
1987. The first European workshop, 
jointly sponsored with IEPS, will be 
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held in Brussels this September. 

When it recognizes an industry need, 
the committee has from time to time 
organized ad hoc task forces to address 
technical problems that require joint 
effort to solve. Over the years it has 
brought into being the concept of 
standard-interchangeable chip carriers 
and under-carpet cable. It is currently 
involved with two operations: testing 
the lead compliance of leaded chip car¬ 
riers and evaluating the corrosion 
prevention performance of silicone gels. 
Both of these efforts involve over 25 
major electronics companies, and in the 
case of the Gel Task Force, the military 
and government testing establishments. 

The committee does not set up these 
activities as permanent working bodies. 
Instead, it turns them over to other 
organizations at the first opportunity. It 
sets no standards, holds no conferences, 
and publishes no transactions or news¬ 
letter. 

Presentations of task force results are 
made at whatever conference suits the 
timing of the report of task force 
activity. Typically, presentations have 
been made at IEPS conferences and at 
NEPCON, both of which have made 
special arrangements for presentation 
of late-breaking news. (IEEE’s ECC 
makes no such provision.) Archival 
publication of reports is in the IEEE 
Transactions on Components, Hybrids, 
and Manufacturing Technology, which 
provides publication support for the 
committee. 

Comparative evaluations. There is 
much to report in the field of packaging 
in this last year. The Spring Workshops 
have conducted comparative evalua¬ 
tions of the major packaging technolo¬ 
gies, through the staging of a special 
comparison accomplished by presenting 
a design problem for a 200,000-gate and 
then a 500,000-gate central processor 
design to be implemented in a series of 
alternative technologies. Designs were 
submitted in pin grid array, leaded and 
leadless chip carriers, the IBM TCM 
technology, and chip on board. There 
was even a DIP design. 

Although costs in the first compari¬ 
son exercise in 1984 ranged from 
$17,000 to $75,000, the speeds were 
almost proportional, so the cost/speed 
criteria were essentially the same (within 
15 percent). This time, costs again 
showed the same range, but there was 
one big difference. All the ECL solu¬ 
tions were almost double the cost of the 
CMOS solutions, and the cost/speed 
criteria showed over a 35 percent differ¬ 
ence in favor of CMOS. One has to 
desire higher speed performance to 
justify the greater cost otherwise 


obtainable with systems of lower cost 
and the same throughput. 

New technologies and information. 

Various new technologies were intro¬ 
duced first at Computer Packaging TC 
meetings. Some of these were foamed 
Teflon reinforced circuit boards (at 
Oiso), impingement air and impinge¬ 
ment water cooling (at Split Rock), 
hand-held computer terminals, and 
most of the new mainframe packaging 
concepts. 

The task force activities also led to 
new information. The investigations of 
the failure phenomena of leaded chip 
carriers led to the discovery that vapor 
phase soldering can cause the solder to 
migrate up the leads away from the sur¬ 
face pad unless the board is preheated, 
and that the solder joints can be split 
apart even before the joints have solidi¬ 
fied unless the leads are differentially 
cooled to prevent relative movement 
during solidification. 

The results of the Compliant Lead 
Task Force, presented at the IEPS in 
November 1986, indicated that there 
were significant differences between the 
leads of commercial package types, and 
that joint failures can be significant for 
packages in severe applications. It was 
apparent that variations in assembly 
placement and soldering can also be a 
significant factor, and that such varia¬ 
bles must be better controlled to pro¬ 
duce clean evaluative results for 
package leads alone. Additional testing 
will then be performed, but the interim 
results are now being published. 


The Gel Task Force began its activity 
with a principal thrust to convince the 
military services to look at semiconduc¬ 
tor packaging other than hermetic 
ceramic. In just a few months, it 
created a climate of acceptance and 
hope that the right surface coatings will 
be found, with a major emphasis and 
change of attitude as the military looks 
for alternatives to leadless ceramic ship 
carriers. 

A new thrust is an interest in the 
availability of low dielectric constant 
circuit boards, and a continuing interest 
is in the methods of heat removal from 
packages. The main concern, however, 
remains the trade-off analysis of the 
alternative packaging choices. 

Chairs and membership. The current 
TC chair is Eugene Shapiro of IBM, 
with Wulf Knausenberger of AT&T Bell 
Labs as the active past chair. Ray Usell 
of Unisys serves as West Coast chair, 
Makoto Watanabe (NTT) and Hisao 
Kanai (NEC) are the Japan coordina¬ 
tors, and Maurice Sage heads the Euro¬ 
pean exploratory venture. Lori 
Capodanno of AT&T serves as the 
treasurer, and Jack Balde of IDC serves 
as the vice chairman, liaison. 

TC membership, which is limited to 
active participants in the meetings, is 
currently about 300. Workshops are by 
invitation or referred invitations only. 

For additional information on the 
Computer Packaging TC, contact Balde 
at Interconnection Decision Consulting, 
Flemington, New Jersey. His phone 
number is (201) 788-5190. 
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CONFERENCES 


Editor: Edmund L. Gallizzi, Computer Science Dept., Eckerd College, St. Petersburg, FL 33733; (813) 867-1166 x272 


ITC looks at integration of test with design and manufacturing 


Until 1980, US technological leader¬ 
ship had been unquestioned for over 
three decades. Since then, it has become 
increasingly clear that maintaining a 
strong leadership position will require 
significant changes in the structure of 
the US industrial and technological cul¬ 
ture, according to Larry Sumney, presi¬ 
dent of the Semiconductor Research 
Corporation. 

Sumney will be the keynote speaker 
at the 18th International Test Confer¬ 
ence. His address, “The Challenge to 
US Technology Leadership,” will initi¬ 
ate the conference theme, “Integration 
of Test With Design and Manufactur¬ 
ing.” His is the first address in a five- 
part plenary session providing a broad, 
integrating view of the status and future 
of US manufacturers, research consor¬ 
tia, government initiatives, standards 
organizations, and the automated-test- 
equipment industry as related to US 
competitive position in the world of 
high-technology manufacturing. 

The other plenary-session invited 
speakers are Laurence C. Seifert, ATT 
vice president for engineering, manu¬ 
facturing, and production; Helen M. 
Wood, NBS Institute for Computer 
Science and Technology acting deputy 
director and Computer Society vice 
president for standards; Robert E. 
Anderson, GenRad senior vice presi¬ 
dent; and Quinton W. Kelly, DARPA 
special assistant for strategic com¬ 
puting. 

In his plenary talk, “Full-Stream 
Testing of Electronic Circuits and 
Equipment: Status and Opportunities,” 
Seifert will identify operational feed¬ 
back mechanisms that use state-of-the- 
art software interacting with computer- 
integrated manufacturing to collect data 
from processing. The data is driven 
back into design and improved produc¬ 
tion processes, reducing or eliminating 
defects. Seifert will discuss needs and 
capabilities in data collection and analy¬ 
sis, design for test, fault diagnosis, and 
component interrelation. 

Wood’s topic is “Computer and 
Communications Standards: Confor¬ 


mance Test Requirements.” She will 
speak on the need for technically sound 
computer and communication stan¬ 
dards to preserve open competition in 
international markets and to support 
increased productivity and delivery of 
services at reduced costs. To support 
the promised benefits of sound stan¬ 
dards, a consistent set of test methods is 
necessary to validate product confor¬ 
mance. The National Bureau of Stan¬ 
dards plays a key role in supporting the 
development of standards and test 
methods. 

Anderson will present “Reducing 
Time-to-Market: A Major Opportunity 
for the Test Community.” He will dis¬ 
cuss reduced time-to-market for new 
electronic components and equipment 
as a key means of achieving a competi¬ 
tive advantage. He will also address the 
test community’s role in meeting time- 
to-market goals. 

Kelly’s topic is “The Future — 
Challenges and Principles for Design, 


Europe must specialize to be success¬ 
ful, D.J. McLauchlan of ICL stated in 
his CompEuro plenary-session address 
on European cooperation. As a good 
example, he cited the European- 
developed Air Bus airplane. An exam¬ 
ple of cooperation is the European 
Computer Research Center, a venture 
of ICL of the United Kingdom, Siemens 
of Germany, and Bull of France. The 
center is located in Munich, Germany. 
Its director is French, and its language 
is English. 

The European and international 
nature of CompEuro, held May 11-15 
in Hamburg, West Germany, was evi¬ 
dent from the conference program. It 
included tutorials, papers, plenary and 
invited talks, and exhibits from numer- 


Manufacturing, and Test.” He will dis¬ 
cuss the increasing dependence of these 
fields on the integration of other tech¬ 
nologies. New materials, parallel 
processing capabilities, and fundamen¬ 
tal principles of integration potentially 
stand to revolutionize the technical 
world. Kelly will give his view of 
DARPA support of these technologies. 
Since the early 1960’s, DARPA’s fund¬ 
ing of computer-related research, 
including MIT’s Project MAC and 
ARPAnet, has been important in laying 
the US technological leadership base. 

A special executive session, led by 
Stephen Balog, Prudential-Bache Secu¬ 
rities vice president, will provide a busi¬ 
ness viewpoint of the ATE industry. 

The conference will also include paper 
presentations, several panel discussions, 
and 15 tutorials on test issues. 

ITC will be held August 
30-September 4 in Washington, DC. 

For additional information, contact 
Doris Thomas at (201) 895-5260. 


ous European countries, Japan, and the 
US. 

The initial conference in an annual 
series, CompEuro 87 had a successful 
first-year attendance of 428, according 
to James H. Aylor, the Computer Soci¬ 
ety’s vice president for conferences and 
tutorials. 

The conference has an initial spon¬ 
sorship equally shared between the 
Computer Society and Region 8 of the 
IEEE. For this year’s conference, 
Region 8 shared its portion of the spon¬ 
sorship with Verband Deutsher 
Elektrotechniker-Frankfurt and Gesell- 
schaft fur Informatik. 

Next year’s CompEuro is scheduled 
for April 11-15 in Brussels, and the fol¬ 
lowing year’s conference is expected to 
convene in Israel. 


First annual CompEuro draws 428 attendees 


112 


COMPUTER 








CALL FOR PAPERS 


Call for papers and referees for Computer 


Technical program, 

Jim Blinn tutorial, 
and Disney panel 
scheduled for SIGGraph 

“The computer graphics field is 
maturing to the point that we were able 
to accept papers describing how people 
are integrating algorithms into graphics 
systems,” says SIGGraph 87 Technical 
Program Chair Maureen Stone of 
Xerox PARC. Paper submissions show 
great interest in special-purpose hard¬ 
ware, algorithms that can be imple¬ 
mented in VLSI, and realistic animation 
based on real physical modeling. 

In addition to the technical program, 
the graphics conference will feature 
panel discussions, tutorials, exhibits, 
and film, video, and art shows. 

One of the conference’s 13 panel dis¬ 
cussions, “Traditions and the Future of 
Character Animation,” will be chaired 
by John Lasseter, director of Luxo, Jr., 
which was nominated for an Academy 
Award in the short animation category. 
The panel will look at the past, present, 
and future of Disney animation and 
high-quality character animation for 
television. Panel members include direc¬ 
tors from Pinocchio, The Great Mouse 
Detective, and Family Dog. Excerpts 
from their work in animation will be 
shown. 

"The Mechanical Universe-. An Inte¬ 
grated View of a Large-Scale Anima¬ 
tion Project,” presented by James F. 
Blinn, is one of 30 one-day courses. The 
Mechanical Universe, a telecourse 
developed by Blinn and broadcast by 
PBS, includes 550 different scenes and 
over seven hours of animation using 
graphics simulations to demonstrate the 
laws of physics. 

Blinn’s SIGGraph course will 
describe the techniques of managing 
projects as large as the development of 
The Mechanical Universe, as well as the 
provision of video engineering accepta¬ 
ble to television standards, artistic 
design, and video hardware for such 
projects. 

Blinn also produced computer 
graphics effects for another PBS series, 
Cosmos, and is the recipient of the 1983 
SIGGraph Computer Graphics Achieve¬ 
ment Award. Currently, he produces 
animation depicting various space mis¬ 
sions for the Jet Propulsion 
Laboratory. 

The conference will be held July 
27-31 in Anaheim, California. Addi¬ 
tional information is available by call¬ 
ing SIGGraph 87 Conference 
Management at (312) 644-6610. 


Computer seeks articles for inclusion in 


Electronic publishing technologies is the 

theme slated for the January 1988 issue. 

Suggested topics include 

• Hypertext systems, 

• computer graphics for publishing, 

• high-resolution printing technology, 

• software systems for document devel¬ 
opment. 

Submit 10 copies of the manuscript by 
August 1, 1987, to Dennis Allison, Com¬ 
puter Systems Laboratory, ERL 444, 
Stanford University, Stanford, CA 
94305; (415)771-8431. 

Neural networks will be covered in the 
March 1988 issue. 

Topics include, but are not limited to 

• neural network architectures; 

• electronic and optical neurocomputers; 

• applications of neural networks in 
vision, speech recognition and synthe¬ 
sis, robotics, image processing, and 
learning; 

• self-adaptive and dynamically recon- 
figurable systems; 

• neural network models; 

• neural algorithms and models of com¬ 
putation; and 

• programming neural network systems. 

Papers must not have been previously 
presented or published; they also must 
not be currently submitted for journal 
publication. 

Submit a 200-word abstract as soon as 
possible to Bruce Shriver, editor-in-chief, 
Computer, IBM T.J. Watson Research 
Center, PO Box 704, Yorktown Heights, 
NY 10598; (914) 789-7626; arpanet: 
shriver@ibm.com; bitnet: shriver at 
yktvmh; compmail +; b.shriver. 

Eight copies of the complete manuscript 
will be due by August 30, 1987. Authors 
will be notified of acceptance by Novem¬ 
ber 1, 1987. The final version of the man¬ 
uscript will be due no later than 
December 1, 1987. 


Multiple-valued logic will be the theme of 
the April 1988 issue. 

Some suggested topics: 

• device technology, 

• fuzzy logic, 

• optical computing, 

• algebraic aspects, 

• logic design, and 

• applications. 

Submit 10 copies of each manuscript by 
September 15, 1987, to Jon T. Butler. 
Before August 15, 1987, Butler can be 
reached at the Naval Postgraduate 
School, Monterey, CA 93943-5100; (408) 
646-3299. After that date he can be 
reached at Northwestern University, 

Dept, of Electrical Engineering and Com¬ 
puter Science, Evanston, IL 60201; (312) 
491-5628. 

Persons interested in serving as referees 
for this issue are asked to contact Butler 
and send him a list of their technical 
interests. 

Referees are also sought for special issues 
that will cover the following topic areas: 

• laser communication technology and 
systems, 

• computer-integrated manufacturing, 

• national computer policies—the com¬ 
puter industry in an international 
context, 

• innovative printer technology, 

• storage technology, 

• concurrent systems: the hardware, 
production, maintenance and software 

• emerging technologies for computer 
component, hardware, and software 
design and production, 

• high-performance CAD/CAM, engi¬ 
neering/scientific, programming work¬ 
stations 

• computers in automobiles, 

• computers in the entertainment and 
advertising industries, 

• real-time systems, and 

• Videotext-has the vision failed? 
Persons interested in serving as referees 
are asked to circle Reader Service Num¬ 
ber 195 on the Reader Service Card at the 
back of this magazine and to send the 
card to the Reader Service Inquiries 
Dept. 


Conferences that the Computer Society participates in or sponsors are 
indicated by the Computer Society logo; additional conference sponsors 
are listed in parentheses. Other conferences of interest to our readers are 
also included. 

For inclusion in Call for Papers or Calendar, submit information six weeks 
before the month of publication (e.g., for the October 1987 issue, send informa¬ 
tion for receipt by August 15,1987) to Calendar Editor, Computer, 10662 Los 
Vaqueros Circle, Los Alamitos, CA 90720. 
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Computer Society of the IEEE Techni- 
cal Committee on Computer Educa¬ 
tion: Contributions up to five typewritten 
pages are welcomed for the TCCE newslet¬ 
ter, a forum for the exchange of ideas among 
persons interested in computer education or 
computers in education. Direct news items, 
short articles, and any correspondence to 
Helen Hays, Dept, of Computer Science, 
Southeast Missouri State University, Cape 
Girardeau, MO 63701; (314)651-2244. 


IEEE Software: Articles on CASE 
tools, workstation software, and scien¬ 
tific systems are sought for the March 1988 
issue. Contact Ted Lewis, editor-in-chief, 
IEEE Software, c/o Computer Science 
Dept., Oregon State University, Corvallis, 
OR 97331; (503) 754-2744; CSnet, 
lewis@oregon-state; Compmail+ , t.lewis. 
Materials are due by August 1, 1987. 


IEEE Infocom 88: Networks— 
Evolution or Revolution?: March 
28-31, 1988, New Orleans. Submit four 
copies of the complete paper by August 3, 
1987, to A1 Leon-Garcia, Dept, of Electrical 
Engineering, University of Toronto, 
Toronto, Ontario M5S 1A4, Canada; phone 
(416) 978-5037. 


CompEuro 88: System Design— 
Concepts, Methods, and Tools: April 
11-14, 1988, Brussels. Papers on theoretical 
and practical aspects of system design are 
sought. Submit five copies of the complete 
paper (20 double-spaced pages maximum) by 
August 15, 1987, to Pierre Wodon, Philips 
Research Laboratory Brussels, 2 Ave. Van 
Becelaere, bte 8, B-l 170 Brussels, Belgium; 
phone 32 (02) 673-41-90. 


Control 88 (IEE): April 13-15, 1988, Oxford, 
England. Submit an extended synopsis (1500 
words maximum) by August 28, 1987, to 
Conference Services, IEE, Savoy PL, Lon¬ 
don WC2R 0BL, England, UK; phone 44 
(01)240-1871. 

International Workshop on Robot Control: 
Theory and Applications (IEE): April 11-12, 
1988, Oxford, England. Submit a synopsis 
(1500 words maximum) by August 28, 1987, 
to the Computing and Control Division, 

IEE, Savoy PL, London WC2R 0BL; phone 
44 (01) 240-1871, ext. 330. 


ICC-88, International Conference on Com¬ 
munications (IEEE): June 12-15, 1988, 
Philadelphia. Sydney, Australia. Submit 
manuscripts by September 1, 1987, to John 
S. Ryan, AT&T Bell Laboratories, Craw¬ 
fords Corner Rd., Rm. 2M632, Holmdel, NJ 
07733-1988; (201)949-5813. 


® Compstan 88, 1988 Computer Stan¬ 
dards Conference: March 21-23, 1988, 
Arlington, Virginia. Papers and proposals 
for panel sessions are sought. Submit five 
copies of papers (1000 to 5000 words) and 
five copies of proposals for panel sessions 


(include a list of four to.five potential 
panelists) by September 14, 1987, to James 
A. Hall, National Bureau of Standards, 

Bldg. 225, Rm. A266, Gaithersburg, MD 
20899; (301)975-3273. 

Computer Networking Symposium: 

April 11-13, 1988, Arlington, Virginia. 
Submit four copies of the complete paper or 
of a 1000-word extended abstract by Septem¬ 
ber 15, 1987, to Jeffrey Jaffe, IBM Research 
Center, PO Box 704, Yorktown Heights, NY 


IEEE International Conference on Robotics 
and Automation: April 25-29, 1988, 
Philadelphia. Papers 15 to 20 pages long are 
sought, as are papers five to seven pages 
long. Submit papers in either category by 
September 15, 1987, to Robert B. Kelley, 
ECSE Dept., Rensselaer Polytechnic Insti¬ 
tute, Troy, NY 12180-3590. 

CHI-88, Conference on Human Factors in 
Computing Systems (ACM): May 15-19, 

1988, Washington, DC. Send five copies of 
paper (3000 words maximum) by September 
21, 1987, to Sylvia Sheppard, Computer 
Technology Associates, Inc., 14900 Sweitzer 
Lane, Suite 201, Laurel, MD 20707; (301) 
369-2422. Proposals for panel sessions, inter¬ 
active poster sessions, demonstrations of 
experimental or commercial user interfaces, 
video presentations, meetings of special- 
interest groups, and presentations by PhD 
students on their dissertations are also 
sought. 

COIS-88, Conference on Office Infor- 

mation Systems: March 23-25, 1988, 
Palo Alto, California. Send five copies of 
papers by September 21, 1987, to Robert B. 
Allen, 2A-367, Bell Communications 
Research, Morristown, NJ 07960; (201) 
829-4315. 

Fourth IEEE Conference on Artificial 

Intelligence Applications: March 
14-18, 1988, San Diego, California. Full- 
length papers (5000 words maximum) and 
poster-session papers (1000 words maximum) 
are sought. Submit four copies of the full- 
length paper (include a 100-word abstract) by 
September 25, 1987, to Elaine Kant or 
Dennis O’Neill, Schlumberger-Doll 
Research, Old Quarry Rd., Ridgefield, CT 
06877-4108. Submit four copies of poster- 
session papers to Elaine Kant or Dennis 
O’Neill by December 21, 1987. 

14th International Conference on Electric 
Contacts (IEEE): June 20-24, 1988, Paris. 
Submit an abstract (200 words maximum) by 
October 1, 1987, to S.E.E., 48 Rue de la 
Procession, 75724 Paris Cedex 15, France. 

IEEE Software: Articles on compiler 

environments and porting techniques 
are sought for the May 1988 issue. Contact 
Ted Lewis, editor-in-chief, IEEE Software, 
c/o Computer Science Dept., Oregon State 
University, Corvallis, OR 97331; (503) 
754-2744; CSnet, lewis@oregon-state; 
Compmail +, t.lewis. Materials are due by 
October 1, 1987. 


IEEE International Symposium on Informa¬ 
tion Theory: June 19-24, 1988, Kobe, Japan. 
Papers suitable for 40-minute and 20-minute 
presentations are sought. To have a paper 
considered in the first category, submit three 
copies of both a complete manuscript and an 
abstract of no more than 180 words; to have 
a paper considered in the second category, 
submit three copies of both a 500-word sum¬ 
mary and an abstract of no more than 180 
words. Materials for papers in the first cate¬ 
gory are due by November 1, 1987, and 
materials for papers in the second category 
are due by December 1, 1987. Submit them 
to either Shu Lin, Dept, of Electrical Engi¬ 
neering, University of Hawaii at Manoa, 
Holmes Hall 483, 2540 Dole St., Honolulu, 
Hawaii 96822 or Suguru Arimoto, Faculty of 
Engineering Science, Osaka University, 
Toyonaka, Osaka 560, Japan. 

IEEE Transactions on Computers: 

Papers are sought for a special issue on 
architectural support for programming lan¬ 
guages and operating systems. Guidelines for 
submitting manuscripts appear in every issue 
of IEEE Transactions on Computers. Send 
seven copies of the manuscript by November 
1, 1987, to Randy Katz, Dept, of Electrical 
Engineering and Computer Science, Com¬ 
puter Science Division, Evans Hall, Univer¬ 
sity of California, Berkeley, CA 94720; (415) 
642-8778. 

IEEE Computer Society Conference on 

Computer Vision and Pattern Recogni¬ 
tion: June 5-9, 1988, Ann Arbor, Michigan. 
Papers suitable for 30-minute and 20-minute 
presentations are sought. Submit three 
copies of papers in either category by 
November 11, 1987, to Larry S. Davis, Cen¬ 
ter for Automation Research, University of 
Maryland, College Park, MD 20742. 


IEEE Software: Articles on fourth- 

generation language development and 
how technology affects software practice are 
sought for the July 1988 issue. Contact Ted 
Lewis, editor-in-chief, IEEE Software, c/o 
Computer Science Dept., Oregon State Uni¬ 
versity, Corvallis, OR 97331; (503) 754-2744; 
CSnet, lewis@oregon-state; Compmail + , 
t.lewis. Materials are due by December 1, 
1987. 

IEEE Software: Articles on human- 
V57 computer interface approaches are 
sought for the September 1988 issue. Contact 
Ted Lewis, editor-in-chief, IEEE Software, 
c/o Computer Science Dept., Oregon State 
University, Corvallis, OR 97331; (503) 
754-2744; CSnet, lewis@oregon-state; 
Compmail +, t.lewis. Materials are due by 
February 1, 1988. 

IEEE Software: Articles on software 
*39 engineering and expert systems are 
sought for the November 1988 issue. Contact 
Ted Lewis, editor-in-chief, IEEE Software, 
c/o Computer Science Dept., Oregon State 
University, Corvallis, OR 97331; (503) 
754-2744; CSnet, lewis@oregon-state; 
Compmail +, t.lewis. Materials are due by 
April 1, 1988. 
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CALENDAR 


July 1987 


g£) ACM SIGGraph 87, July 27-31, Ana- 
^57 heim, California. Contact SIGGraph 
87 Conference Management, Smith Bucklin 
and Associates, Inc., Ill E. Wacker Dr., 
Suite 600, Chicago, IL 60601; (312) 
644-6610. 


14th Annual Conference on Computer 
Graphics and Interactive Techniques (ACM), 
July 27-31, Anaheim, California. Contact 
SIGGraph 87 Conference Management, 
Smith Bucklin and Associates, Inc., Suite 
600, Chicago, IL 60601; (312) 644-6610. 


August 1987 


Seventh International Conference on Com¬ 
puter Science (IEEE), August 3-7, Santiago, 
Chile. Contact Hector Garcia-Molina, Dept, 
of Computer Science, Princeton University, 
Princeton, NJ 08544. 


g^, IEEE/ACM Symposium on the Simu- 
lation of Computer Networks (Com¬ 
puter Society, ACM, SCS), August 5-7, 

Colorado Springs, Colorado. Contact Mitch¬ 
ell Spiegel, GTE Systems, 1700 Research 
Blvd., Rockville, MD 20850; (301) 294-8400. 


Sixth ACM SIGACT-SIGOps Symposium 
on Distributed Computing, August 10-12, 

Vancouver, British Columbia, Canada. Con¬ 
tact David Kirkpatrick, Dept, of Computing 
Science, University of British Columbia, 
Vancouver, British Columbia, Canada, or 
Fred B. Schneider, Dept, of Computer 
Science, Upson Hall, Cornell University, 
Ithaca, NY 14853. 

ACM SIGComm 87 Workshop: Frontiers in 
Communications Technology, August 11-13, 

Stowe, Vermont. Contact John Trono, 
Computer Science Dept., St. Michael’s Col¬ 
lege, PO Box 243, Winooski, VT 05404. 

gii Ninth International Conference on 
v*7 Production Research (Computer Soci¬ 
ety, IFPR), August 17-20, Cincinnati, Ohio. 
Contact Ernest L. Hall, Center for Robotics 
Research, University of Cincinnati, ML 72, 
Cincinnati, OH 45221. 

1987 International Conference on Parallel 
Processing, August 17-21, St. Charles, 
Illinois. Contact T. Feng, Pennsylvania State 
University, Dept, of Electrical Engineering, 
University Park, PA 16802; (814) 863-1469. 


gN International Workshop on Petri Nets 
*5*7 and Performance Models (Computer 
Society, ACM), August 24-26, Madison, 


Wisconsin. Contact Tadao Murata, Univer¬ 
sity of Illinois at Chicago, Dept, of Electrical 
Engineering and Computer Science, MC 154, 
Box 4348, Chicago, IL 60680; (312) 996-2307 
or (312) 996-3422. 


g^ 1987 IEEE Workshop on Languages 
NS7 for Automation, August 24-27, 

Vienna. Contact Shi-Kuo Chang, Dept, of 
Computer Science, University of Pittsburgh, 
Pittsburgh, PA 15260; (412) 624-8441. 


Eurographics 87 (ACM, IFIP), August 
24-28, Amsterdam. Contact Secretariat 
Eurographics 87, c/o Organisatie Bureau 
Amsterdam, Europaplein 12, 1078 GZ 
Amsterdam, The Netherlands; phone 31 (20) 
44-08-07. 


Tencon, August 26-28, Seoul, South Korea. 
Contact Jung Uck Seo, Korea Telecom 
Authority, 100 Sejong-Re, Chongro-ku, 
Seoul 110, South Korea. 


Symposium on Logic Programming, 
\57 August 31-September 4, San Francisco. 
Contact David S. Warren, Quintus Com¬ 
puter Systems, Inc., 1310 Villa St., Moun¬ 
tain View, CA 94041. 


September 1987 


g^ j 1987 Annual International Test Con- 
v*7 ference, September 1-3, Washington, 
DC. Contact Doris Thomas, PO Box 264, 
Mount Freedom, NJ 07970; (201) 895-5260. 


Second IFIP Conference on Human Interac¬ 
tion, September 1-4 (ACM), Stuttgart, West 
Germany. Contact H.J. Bullinger, FhG 
I AO, Holzgarteustr. 17, 7000 Stuttgart 1, 
West Germany. 


g^, 13th International Conference on Very 
\*7 Large Databases (Computer Society, 
ACM, BCS), September 1-4, Brighton, 
England. Contact Stuart E. Madnick, Mas¬ 
sachusetts Institute of Technology, Sloan 
School of Management, 50 Memorial Dr., 
E53-317, Cambridge, MA 02139; (617) 
253-6671. 


Fifth Workshop on Multidimensional Signal 
Processing (IEEE, EURASIP), September 
14-16, Noordwijkerhout, The Netherlands. 
Contact Mrs. Y. Smits, Dept, of Electrical 
Engineering, Delft University of Technol¬ 
ogy, PO Box 5031, 2600 GA Delft, The 
Netherlands. 


Third International Conference on Func¬ 
tional Programming Languages and Com¬ 
puter Architecture (ACM, IFIP, INRIA), 
September 14-16, Portland, Oregon. Contact 
John H. Williams, IBM Almaden Research 
Center, K53/803, 650 Harry Rd., San Jose, 


CA 95120-6099 or Richard B. Kieburtz, 
Computer Science Dept., Oregon Graduate 
Center, 19600 N.W. Von Neumann Dr., 
Beaverton, OR 97006-1999; (503) 690-1152. 


g^, ICDCS-7, Seventh International Con- 
Y*7 ference on Distributed Computing Sys¬ 
tems, September 14-18, Berlin. Contact R. 
Popescu-Zeletin, Hahn-Meitner-Institut Ber¬ 
lin, Glienicker Strasse 100, D-1000, Berlin 
39, West Germany; phone 49 (30) 8009-2594 
or 49 (30) 8009-2541. 


ICCC-ISDN 87, Evolving to Integrated Serv¬ 
ices Digital Networks in North America, 
September 15-17, Dallas. Contact Audrey 
Atterbury, Bell Atlantic, 1310 N. Court 
House Rd., tenth floor, Arlington, VA 
22201;(703)974-5454. 


Midcon 87 (IEEE), September 15-17, 

Rosemont, Illinois. Contact Alexes 
Razevich, Electronic Conventions Manage¬ 
ment, 8110 Airport Blvd., Los Angeles, CA 
90045; (213) 772-2965 or (800) 421-6816. 


g^ 26th Lake Arrowhead Workshop: 

Y*7 Specifying Concurrent Systems in the 
Year 2000, September 16-18, Lake Arrow¬ 
head, California. Contact Brent Hailpern, 
IBM T.J. Watson Research Center, PO Box 
704, Yorktown Heights, NY 10598; (914) 
789-7797; CSnet bth@ibm.com. 


IEEE Holm Conference on Electrical Con¬ 
tacts, September 21-23, Chicago. Contact 
IEEE Holm Conference Registrar, 345 E. 
47th St., New York, NY 10017; (212) . 
705-7405. 

git CSM-87, Conference on Software 
N*Z Maintenance (Computer Society, 
AWC, DPMA, NBS, SMA), September 
21-24, Austin, Texas. Contact Roger Martin, 
National Bureau of Standards, Bldg. 225, 
Rm. B266, Gaithersburg, MD 20899; (301) 
921-3545 or Computer Society of the IEEE, 
1730 Massachusetts Ave. NW, Washington, 
DC 20036-1903; (202) 371-0101. 


IEEE Videoconference on High-Performance 
Integrated Circuit Packaging, September 22, 

various sites. Contact'IEEE Continuing Edu¬ 
cation Dept., IEEE Service Center, 445 Hoes 
Lane, PO Box 1331, Piscataway, NJ 
08855-1331; (201) 981-0060, ext. 412. 

Northcon (IEEE), September 22-24, Port¬ 
land, Oregon. Contact Dale Litherland, 
Electronic Conventions Management, 8110 
Airport Blvd., Los Angeles, CA 90045; (213) 
772-2965. 


g^j International Conference on Software 
v*7 Engineering for Real-Time Systems, 
September 28-30, Cirencester, England. 
Contact R. Larry, Institute of Electronic and 
Radio Engineers, 99 Gower St., London 
WC1E 6AZ, England UK. 


July 1987 
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g^) Oceans 87, September 28-October 1, 

ns? Halifax, Nova Scotia, Canada. Con¬ 
tact Alan R. Longhurst, Bedford Institute of 
Oceanography, PO Box 1006, Dartmouth, 
Nova Scotia B2Y 4A2, Canada. 


October 1987 

gj) 1987 Workshop on Computer Archi- 
N57 tecture for Pattern Analysis and 
Machine Intelligence, October 5-7, Seattle. 
Contact Steve Tanimoto, Dept, of Computer 
Science, FR-35, University of Washington, 
Seattle, WA 98195; (206) 543-1695. 

g^t 12th Conference on Local Computer 
Vay Networks, October 5-7, Minneapolis, 
Minnesota. Contact Stephanie Johnson, 
Start, Inc., 10301 Toledo Ave. South, 
Bloomington, MN 55437; (612) 831-2122. 

® ASPLOS-II, Second International 
Conference on Architectural Support 
for Programming Languages and Operating 
Systems (Computer Society, ACM), October 
5-8, Palo Alto, California. Contact Randy 
H. Katz, Computer Sciences Division, UC 
Berkeley, Evans Hall, Berkeley, CA 94720 or 
Martin Freeman, Signetics Corp., 811 E. 
Arques Ave., Sunnyvale, CA 94086; (408) 
991-3591. 

g^ l ICCD-87, IEEE International Confer- 
ence on Computer Design: VLSI in 
Computers and Processors, October 5-8, Rye 

Brook, New York. Contact Prathima 
Agrawal, AT&T Bell Laboratories, 600 
Mountain Ave., Rm. 3D-480, Murray Hill, 
NJ 07974; (201)582-6943. 

IWDM-87, Fifth International Workshop on 
Database Machines (ACM), October 5-8, 

Karuizawa, Nagano Prefecture, Japan. Con¬ 
tact M. Kitsuregawa, Institute of Industrial 
Science, University of Tokyo, 7-22-1, Rop- 
pongi, Minato-ku, Tokyo 106, Japan; phone 
81 (03)402-6231. 

OOPSLA-87, Object-Oriented Programming 
Systems, Languages, and Applications 
(ACM), October 5-8, Kissimmee, Florida. 
Contact Chet Winiski, Productivity Products 
International, 27 Glen Rd., Sandy Hook, CT 
06482. 

g^) Compsac 87 (Computer Society, IPSJ), 
October 5-9, Tokyo. Contact Tosiyasu 
L. Kunii, c/o Business Center for Academic 
Societies Japan, Yamazaki Bldg. 4F, 

2-40-14, Hongo, Bunkyo-ku, Tokyo 113, 
Japan; phone 81 (3) 817-5831, or Albert K. 
Hawkes, Sargent & Lundy, Engineering 
Consultants, 55 E. Monroe, Chicago, IL 
60603; (312) 269-3640, or Stephen S. Yau, 
Northwestern University, Dept, of Electrical 
Engineering and Computer Science, Evan¬ 
ston, IL 60201; (312) 491-3641. 

International Workshop on AI Applications 
to CAD Systems for Electronics (IEEE), 
October 8-10, Munich, West Germany. Con¬ 


tact Werner Sammer, Siemens AG, Cor¬ 
porate Research and Technology, 
Otto-Hahn-Ring 6, 8000 Munich 83, West 
Germany; phone 49 (89) 636-3348 or Alfred 
C. Weaver, Lockheed EMSCO, MS B-08, 
2400 NASA Rd. One, Houston, TX 77058; 
(713)333-6792. 


KTO Seventh Annual Symposium on Small 
Computers in the Arts, October 8-11, 

Philadelphia. Contact Richard Moberg, 338 
S. Quince St., Philadelphia, PA 19107; (215) 
834-1511. 

g^l Second Workshop on Large-Grained 
vS? Parallelism, October 11-14, Somerset, 
Pennsylvania. Contact Maurice Herlihy, 
Dept of Computer Science, Carnegie Mellon 
University, Pittsburgh, PA 15213; (412) 
268-2584. 

FOCS-87, October 12-14, Los Angeles. 
Contact Ashok Chandra, IBM T.J. 
Watson Research Center, PO Box 218, 
Yorktown Heights, NY 10598; (914) 
945-1752. 

International Symposium on Methodologies 
for Intelligent Systems (ACM), October 
14-17, Charlotte, North Carolina. Contact 
Zbigniew W. Ras, Dept, of Computer 
Science, University of North Carolina, Char¬ 
lotte, NC 28223; (704) 547-4567. 

g^ Third Annual Expert Systems in 

Government Conference (Computer 
Society, AIAA), October 19-23, Washing¬ 
ton, DC. Contact Peter Bonasso, Mitre 
Washington AI Center, 7725 Colshire Blvd., 
MS W952, McLean, VA 22102; (703) 


Second International Conference on Data 
and Knowledge for Manufacturing and 
Knowledge (ACM), October 19-24, Hart¬ 
ford, Connecticut. Contact Fred Maryanski, 
University of Connecticut, Computer 
Science and Engineering Dept., U-155, 
Storrs, CT 06268. 

g^) AIPR-88, Applied Imagery Pattern 
Recognition, October 23-30, Washing¬ 
ton, DC. Contact Jane Harmon, 403 Argus 
PL, Sterling, VA 22107; (703) 351-2708. 

FJCC-87, Fall Joint Computer Conference 
(Computer Society, ACM), October 25-29, 

Dallas. Contact Debra Anthony, Texas 
Instruments, 6500 Chase Oaks Blvd., PO 
Box 86905, MS 8419, Plano, TX 75086; 

(214) 575-2151. 


November 1987 

21st Annual Asilomar Conference on Sig¬ 
nals, Systems, and Computers (IEEE, Naval 
Postgraduate School), November 2-4, 

Pacific Grove, California. Contact Douglas 
F. Elliott, Rockwell International Corp., 
3370 Miraloma Ave., MS BD07, Anaheim, 
CA 92803-3170. 


g^) Workshop on Workstation Operating 
Systems, November 5-6, Cambridge, 
Massachusetts. Contact Luis-Felipe Cabrera, 
6572 Northridge Dr., San Jose, CA 95120; 
(408)927-1838. 

11th ACM Symposium on Operating Sys¬ 
tems Principles, November 9-11, Austin, 
Texas. Contact Alfred Spector, Dept, of 
Computer Science, Carnegie Mellon Univer¬ 
sity, Pittsburgh, PA 15213 or Les Belady, 
MCC, 9430 Research Blvd., Echelon Bldg. 
No. 1, Suite 200, Austin, TX 78759; (512) 
834-3330. 

£3)) ICCAD-87, IEEE International Con- 
ference on Computer-Aided Design, 
November 9-12, Santa Clara, California. 
Contact Basant Chawla, AT&T Bell Labora¬ 
tories, 1247 S. Cedar Crest Blvd., Allen¬ 
town, PA 18103; (215) 770-3484. 

giv International Conference on Informa- 
lion Science and Engineering (Com¬ 
puter Society, IERE), November 25-27, 

York, England. Contact R. Larry, Institute 
of Electronic and Radio Engineers, 99 
Gower St., London WC1E 6AZ, England, 
UK; 44 (01) 388-3071. 

gjv Workshop on Computer Vision, 
November 30-December 2, Miami 
Beach, Florida. Contact Harry Hayman, 738 
Whittaker Terrace, Silver Spring, MD 20901; 
(301) 434-1990. 


December 1987 

g^) Eighth Real-Time Systems Symposium, 
vaz December 1-3, San Francisco. Contact 
Kang G. Shin, Dept, of Electrical Engineer¬ 
ing and Computer Science, University of 
Michigan, Ann Arbor, MI 48109-1109; (313) 
763-0391. 

Eighth Annual International Conference on 
Information Systems (ACM, SIM), Decem¬ 
ber 6-9, Pittsburgh. Contact William R. 

King, Graduate School of Business, Univer¬ 
sity of Pittsburgh, Pittsburgh, PA 15260; 
(412)648-1587. 


January 1988 


g^) 21st Hawaii International Conference 
VS7 on System Sciences, January 5-8, 
Kailu-Kona, Hawaii. Contact Ralph H. 
Sprague, Jr., Decision Sciences Dept., Uni¬ 
versity of Hawaii, 2404 Maile Way, E-303, 
Honolulu, HI 96822; (808) 948-7430. 

g^ Annual IEEE Design Automation 
Workshop, January 13-15, Apache 
Junction, Arizona. Contact Walling Cyre, 
Control Data, HQM 173, Box 1249, Min¬ 
neapolis, MN 55440; (612) 853-2692. 
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FOURTH IEEE SYMPOSIUM ON 

LOGIC PROGRAMMING 

August 31-September 4, 1987 
Hyatt on Union Square 
San Francisco 


Conference Chairperson: David S. Warren, Quintus Computer Systems 
Program Chairperson: Seif Haridi, Swedish Institute of Computer Science 

Logic programming is a new and growing field. The Fourth IEEE Symposium on Logic Programming will provide a forum for 
researchers active in the area to meet and exchange ideas. It also provides novices to the field an opportunity to become 
familiar with the important issues in the area of logic programming. The program includes 49 research papers presented by 
leading researchers from around the world and covering a wide range of topics in the area. In addition the 5 invited speakers are 
some of the most respected workers in the logic programming area. A special aspect of this year’s Symposium is the range of 
tutorials that will be offered on Monday August 31. There will be 8 half-day tutorials that cover both introductory and advanced 
topics. The number and range of advanced tutorials this year is especially exciting. 


Sessions include: 

• Language Issues 

• Parallelism in Prolog 

Architecture, Parallel Models 
Parallel Machines 

• Language Theory and Semantics 


• Program Analysis and Methodology 
■ Logic and Databases 

• Program Development and Methodology 

• Implementations 

• Applications 


Morning Tutorials: 

1: An Introduction to AI Programming Using Prolog 

George Luger, Department of Computer Science, University of New 
Mexico 

2: Parallel Prolog Tutorial 

Seif Haridi, Logic Programming Systems, Swedish Institute of 
Computer Science 

3: Constraint Logic Programming 

J. Jaftar, J-L Lassez, C. Lassez, IBM T. J. Watson Research Center 

4: Practical Prolog for Real Programmers 

Richard A. O’Keefe, Quintus Computer Systems, Inc. 


Invited Speakers: 

■ Cuthbert Hurd, Quintus Computer Systems 

■ David H. D. Warren, Manchester University 

• Ehud Shapiro, Weizman Institute 

■ Takashi Chikayama, ICOT 

• William Kornfeld, Quintus Computer Systems 


Afternoon Tutorials: 

5: Advanced AI Prolog Programming Techniques 

Leon Sterling, Computer Engineering <§ Science, Case Western 
Reserve University 

6: The Warren Abstract Machine 

David S. Warren, Quintus Computer Systems, Inc. 

7: Natural Language and Logic Programming (Logic Grammars) 

Lynette Hirschman, Technical Director for AI, Unisys Defense Systems 

8: Parallel Logic Programming Languages 

Ian Foster & Steve Gregory, Imperial College 


Symposium Registration 


Advance symposium and tutorial registration is available until August 
1, 1987. No refunds will be made after that date. To register and for 
a complete advance program, contact: 

Fourth IEEE Symposium on Logic Programming 

Computer Society of the IEEE 

c/o Glenda McBride 

1730 Massachusetts Avenue, N.W. 

Washington, D.C. 20036-1903 
(202)371-1013 


Symposium Registration: 

IEEE-Computer Society members 
Non-members 
Full-time student members 
Full-time student non-members 
Retired members 


Advance On-Site 

$190 $230 
$240 $290 
$ 50 $ 50 
$ 65 $ 65 
$ 50 $ 50 


Hotel Reservation Coupon 

Mail or Call: 

Hyatt on Union Square 

345 Stockton, San Francisco, CA 94108 
Phone: 1-800-228-9000 or 415-398-1234 
Telex: 340592 


Affil 


Full Mailing. 


Telephone:- 

Date of Arrival:_Date of Departure: 


Tutorial Registration: Circle 1 
AM: 

2: Haridi 

3: Jaffar, Lassez, Lassez 
4: O’Keefe 


2 (one from each column) 
PM: 

5: Sterling 
6: Warren 
7: Hirschman 
8: Foster, Gregory 


1EEE-C. S. members 
Non-members 
Full-time students 


Advance 

$85 (2 for $140) 
$105 (2 for $175) 
$35 (2 for $65) 


On-Site 

$100 (2 for $170) 
$130 (2 for $215) 
$35 (2 for $65) 


Total enclosed:- 

A deposit of one night’s room or credit card guarantee is required for 
arrivals after 4 p.m. 

Room rates (circle your choice): 

Single Double 

$ 94 $109 

$109 $124 

Other rooms and suites available. 

Reservations must be made mentioning SLP’87 by August 1, 1987 to 
guarantee these special rates. 



THE COMPUTER SOCIETY 
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CAREER OPPORTUNITIES 


RATES: $12.00 per line, $120 minimum charge 
(up to ten lines). Average six typeset words per 
line, nine lines per column inch. Add $10 tor box 
number. Send copy at least six weeks prior to 
month of publication to: Heidi Rex or Marian 
Tibayan, Classified Advertising, IEEE COM¬ 
PUTER Magazine, 10662 Los Vaqueros Circle, 
Los Alamitos, CA 90720. 

In order to conform to the Age Discrimination in 
Employment Act and to discourage age discrim¬ 
ination, COMPUTER may reject any advertise¬ 
ment containing any of these phrases or similar 
ones: “...recent college grads...,” “...1-4 
years maximum experience..“.. .up to 5 
years experience....” or “.. .10 years maxi¬ 
mum experience.” COMPUTER reserves the 
right to append to any advertisement, without 
specific notice to the advertiser, “Experience 
ranges are suggested minimum requirements, 
not maximums.” COMPUTER assumes that, 
since advertisers have been notified of this 
policy in advance, they agree that any experi¬ 
ence requirements, whether stated as ranges or 
otherwise, will be construed by the reader as 
minimum requirements only. 


UNIVERSITY OF ILLINOIS 
AT URBANA-CHAMPAIGN 
Supercomputer Center Staff 

The University has established the Center for 
Supercomputing Research and Development 
(CSRD) with support from the U.S. Government, 
the State of Illinois, and industrial affiliates. The 
CSRD is seeking additional qualified technical 
staff with expertise in high-speed multiproces¬ 
sor systems. 

The CSRD has initiated an effort to construct the 
Cedar multiprocessor that will deliver high per¬ 
formance over a wide range of computations. 
This effort is based on over 15 years of research 
in software, architecture, and algorithm develop¬ 
ment performed by a nucleus of experienced 
research staff. We are seeking additional appli¬ 
cants with degrees in Computer Science, Com¬ 
puter Engineering or Electrical Engineering for 
the following full-time regular positions: 
SENIOR SOFTWARE ENGINEERS. To provide 
programming expertise and technical leadership 
in compiler development for parallel and array 
computers, operating systems implementation, 
and instrumentation of parallel systems for per¬ 
formance evaluation. Ph.D. preferred. M.S. degree 
with 4 years experience required. 

SOFTWARE ENGINEERS. To take part in research 
and development of the above software activities. 
M.S. or equivalent experience preferred. B.S. 
degree with 2 years experience required. 

SENIOR COMPUTER SCIENTISTS. To provide 
programming expertise and technical leadership 
in the development of multiprocessor numerical 
and nonnumerical algorithms in a variety of ap¬ 
plications, including graphics. Ph.D. preferred. 
M.S. degree with 2 years experience required. 
COMPUTER SCIENTISTS. To take part in re¬ 
search and development of the above numerical 
and nonnumerical algorithms. M.S. or equivalent 
experience preferred. B.S. degree with 2 years 
experience required. 


SENIOR COMPUTER SYSTEMS ENGINEERS. 
To provide expertise and technical leadership in 
development of state-of-the-art multiprocessor 
architecture and hardware. Ph.D. preferred. M.S. 
degree with 4 years experience required. 
COMPUTER SYSTEMS ENGINEERS. To take 
part in research and development of state-of-the- 
art multiprocessor architecture and hardware. 
M.S. or equivalent experience preferred. B.S. 
degree with 2 years experience required. 

In rare instances, equivalent qualifications will 
be acceptable in lieu of those stated above. 

We are also seeking applicants with degrees in 
Computer Science, Computer Engineering or 
Electrical Engineering for the following full-time 
and/or part-time temporary positions: 

VISITING SENIOR SOFTWARE ENGINEERS. To 
provide programming expertise and technical 
leadership in compiler development for parallel 
and array computers, operating systems im¬ 
plementation, and instrumentation of parallel 
systems for performance evaluation. Ph.D. 
preferred. M.S. degree with 4 years experience 
required. 

VISITING SOFTWARE ENGINEERS. To take part 
in research and development of the above soft¬ 
ware activities. M.S. or equivalent experience 
preferred. B.S. degree with 2 years experience 
required. 

VISITING SENIOR COMPUTER SCIENTISTS. To 
provide programming expertise and technical 
leadership in the development of multiprocessor 
numerical and nonnumerical algorithms in a 
variety of applications, including graphics. Ph.D. 
preferred. M.S. degree with 2 years experience 
required. 

VISITING COMPUTER SCIENTISTS. To take part 
in research and development of the above 
numerical and nonnumerical algorithms. M.S. or 
equivalent experience preferred. B.S. degree 
with 2 years experience required. 

VISITING SENIOR COMPUTER SYSTEMS ENGI¬ 
NEERS. To provide expertise and technical 
leadership in development of state-of-the-art 
multiprocessor architecture and hardware. Ph.D. 
preferred. M.S. degree with 4 years experience 
required. 

VISITING COMPUTER SYSTEMS ENGINEERS. 
To take part in research and development of state- 
of-the-art multiprocessor architecture and hard¬ 
ware. M.S. or equivalent experience preferred. B.S. 
degree with 2 years experience required. 

In rare instances, equivalent qualifications will 
be acceptable in lieu of those stated above. 
Please send resume including educational back¬ 
ground, recent professional experience, and 
salary requirements to: 

Judy Maier 

Center for Supercomputing Research 
and Development 

104 S. Wright St., 305 Talbot Lab 

Urbana, Illinois 61801 217/244-0061 
All available positions provide salaries commen¬ 
surate with experience. 

Please specify the position or positions tor which 
you are applying. If not, your application may be 
delayed. In order to insure full consideration, all 
applications should be submitted by August 15, 
1987. Interviews may take place before the closing 
date but no final decisions will be made until after 
the closing date. The starting date is flexible. 

The University of Illinois is an Affirmative Ac¬ 
tion/Equal Opportunity Employer. 


TEMPLE UNIVERSITY 
Faculty Positions 
Department of Computer and 
Information Sciences 

The Department of Computer and Information 
Sciences has tenure-track and adjunct faculty 
positions available in the area of Information 
Systems. 

Applicants should hold the Ph.D degree in Infor¬ 
mation Systems, Computer Science, or a closely 
related field. The ability to contribute to strong 
instructional and research programs, both grad¬ 
uate and undergraduate, will be the primary req¬ 
uisite for appointment. Salary and rank will be 
determined by the appointee’s experience. An 
applicant for a senior position should have a 
strong record of scholarly achievement, while an 
applicant for an assistant professorship should 
present evidence of research potential. 

The Department of Computer and Information 
Sciences offers programs leading to the 
Bachelors, Masters and Ph.D degrees in Busi¬ 
ness Administration/Information Science as 
well as in Computer Science. Currently the 
Department has a full-time faculty of 27, 300 
graduate students enrolled in the masters and 
Ph D programs, and 1200 undergraduate majors. 
A networked departmental computing facility in¬ 
cludes a VAX-11/780, VAX-11/750, and 20 
microVAX systems configured as time-shared 
systems and workstations. The facility also in¬ 
cludes an AT&T 3B2, Tl Explorer, and an image 
processing laboratory. Departmental instruc¬ 
tional laboratories use VAX (VMS or UNIX) and 
PDP-11 systems, as well as microcomputers. 
University facilities currently in use by the 
department include a CDC Cyber 750, an IBM 
3081, an IBM 4381 and several microcomputer 
laboratories. A multi-processor, high speed com¬ 
puting facility is also available for use. 

Temple University is an Equal Opportunity/Affir¬ 
mative Action Employer and specifically invites 
and encourages applications from women and 
minorities. To apply, submit vitae, and biblio¬ 
graphy to: 

Eugene Kwatny, 

Chairman of Faculty Search Committee, 
Department of Computer and Information 
Sciences, 

Temple University, 

Computer Activities Building, 38-24, 
Philadelphia, PA 19122. 


KNOX COLLEGE 

Mathematics and Computer Science 
Galesburg, Illinois 61401 

Two positions, one tenure track and one visiting, 
for computer scientist/mathematician to teach 
computer science and mathematics in a depart¬ 
ment of mathematics and computer science. 
Teaching load is two courses per term for each 
of three terms. Salary and rank dependent on 
qualifications and experience. Knox is a selec¬ 
tive liberal arts college. Applications from 
women and minorities are strongly encouraged. 
Send vita, transcripts, and three letters of recom¬ 
mendation to Dennis M. Schneider, Chairman. 
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SOFTWARE ENGINEER 


THE UNIVERSITY OF FLORIDA 
Associate Vice President for 
Telecommunications 

The University of Florida is seeking applicants and 
nominations for a newly-created position in its 
central administration, Associate Vice President 
for Academic Affairs for Telecommunications. 
The person in this position will be responsible 
for providing the leadership necessary to bring 
the University to the forefront in telecommunica¬ 
tion and local networking services in support of 
research and education programs. Major respon¬ 
sibilities include: developing and implementing 
a long-range plan for telecommunications and 
networking; managing a major study to develop 
and implement an integrated metropolitan area 
network; and supervising and coordinating the 
activities of two major computer centers which 
provide academic and administrative computing 
services to the University and region. Ancillary 
duties include the direction and supervision of 
telecommunications and network personnel, 
serving as the Information Resource Manager for 
the University and as the Institutional Represen¬ 
tative to BITNET. 

Qualifications for this position include: a mini¬ 
mum of three years successful experience in man¬ 
agement of telecommunications and data net¬ 
working in a higher education environment for 
research and education programs; a Ph.D. or 
equivalent experience (a master's degree in an ap¬ 
propriate discipline is a minimal requirement); a vi¬ 
sion for future telecommunications development 
in higher education; an understanding of tele¬ 
communications and networking in heterogene¬ 
ous environments; a sensitivity to the needs of 
end users; superior leadership and organizational 
skills; and excellent communication skills, both 
written and oral. 

This position will be available as of September 1, 
1987, but the starting date is negotiable. The 
salary is negotiable based upon degree and ex¬ 
perience. 

Nominations, applications, and inquiries must be 
submitted on or before July 31, 1987, with 
resumes and the names of three references to: 
Dr. Jeaninne Webb, Chairman 
Search and Screen Committee 
1012J Turlington Hall 
University of Florida 
Gainesville, Florida 32611 
The search will be conducted under Florida's 
“Government in the Sunshine” law. The University 
of Florida is an Equal Employment Opportunity 
Affirmative Action Employer. 


NAVAL POSTGRADUATE SCHOOL, 
Monterey, CA 

Position Announcement in Computer Science 

The Department of Computer Science has im¬ 
mediate openings for faculty positions at all 
levels in the areas of Artificial Intelligence and 
Computer Architecture. An applicant should 
have a PhD in Computer Science or a related 
field and have a strong interest in both graduate 
teaching and research. Senior applicants must 
have distinguished research records. Appoint¬ 
ments can begin at any time during the year. 
The Department offers MS and PhD degrees in 
Computer Science supported by well-equipped 
instructional/research facilities and a full-time 
technical staff. The faculty normally teach for 
two quarters and conduct full-time research sup¬ 
ported by major research programs during the 
other two quarters. 

Please send a detailed resume and three letters 
of reference to: Professor Vincent Lum, Chair, 
Computer Science Department (Code 52), Naval 
Postgraduate School, Monterey, CA 93943, 
Telephone (408)646-2449. NPS IS AN EQUAL OP¬ 
PORTUNITY/AFFIRMATIVE ACTION EMPLOYER. 


NORWICH UNIVERSITY 

Faculty position in Computer Science Engineer¬ 
ing. Tenure-track position, rank and salary com¬ 
mensurate with experience. Ph.D. desired, but 
will consider Masters with experience. Available 
1 July, 1987. Recently implemented Computer 
Science Engineering program. Facilities in¬ 
clude: VAX Cluster, MicroVax supermicrocom¬ 
puters and PC based workstations. New faculty 
will assist in laboratory and curriculum develop¬ 
ment. Expertise in any of the following areas 
desired: Digital Systems Design, CAD/CAE, 
Computer Architecture, Computer Communica¬ 
tions and Networking. A strong commitment to 
undergraduate education is essential. Norwich 
University is located in an area of Central Ver¬ 
mont that offers small town or rural living with 
good schools and outstanding recreational op¬ 
portunities. U.S. Citizen or Permanent Resident 
preferred. Send resume and references to Dr. 
Michael C. Murphy, Head, Computer Science 
Engineering Department, Norwich University, 
Northfield, VT 05663. Norwich University is an 
equal opportunity employer. 


SOFTWARE ENGINEER 

Design, develop and analyze computer software 
utilizing “C" Compiler programs and UNIX 
operating systems. Interface with users to 
define requirements and develop or enhance 
programs and systems. Process complex algo¬ 
rithms to execute programs. Requires B.S. Com¬ 
puter Science, 3 years experience, UNIX, 
VM/CMS, DOS, COBOL, FORTRAN, PL/I and C 
Compiler. $39,500 per year. Job site and inter¬ 
view Santa Monica, CA. Send this ad and your 
resume to Job #SK 4303, P.O. Box 9570, 
Sacramento, CA 95823-0570 not later than July 
25, 1987. 


THE FLINDERS UNIVERSITY 
OF SOUTH AUSTRALIA 
SCHOOL OF MATHEMATICAL SCIENCES 
Lecturer/Senior Lecturer 
In Computer Science 

Applications are invited for the position of Lec¬ 
turer/Senior Lecturer in the Discipline of Com¬ 
puter Science. Applicants should hold a post¬ 
graduate degree, preferably a Ph.D. in Computer 
Science. Duties will include undergraduate and 
honours level teaching, and the supervision of 
research students. Appointment at the Senior 
Lecturer level would require a significant record 
of achievement in research and teaching. The 
position will be available from 1 December, 1987. 
The Discipline offers a Computer Science major, 
honours programme and postgraduate oppor¬ 
tunities. Machines used for teaching include 
four Prime 2250's, Pyramid 90X and Sun 3/280. 
Further information may be obtained from Dr. 
M.J. Brooks, Discipline of Computer Science, or 
by telephone on (08) 275 2662, or by electronic 
mail on job@cs.flinders.oz.au. 

Flinders University is situated in the Southern 
suburbs of Adelaide in the foothills of the Mount 
Lofty Ranges and overlooks the City and St. Vin¬ 
cent’s Gulf. The University has earned a reputa¬ 
tion for its excellent record in research. 

Salary Scale: Lecturer $A27,859—$A36,600 

(An appointment will not be made 
above the sixth level of the scale 
viz. $A34,102) 

Sr. Lecturer: $A37,381 —$A43,568 
Applications, including detailed curriculum vitae 
and the names and addresses of three referees, 
should be lodged in duplicate, with the 
Registrar, The Flinders University of South 
Australia, Bedford Park, S.A. 5042, by 7th August, 
1987. 


Design and implement software to handle inter¬ 
face between IBM PC and real time alarm 
system. Duties include determining protocol for 
PC and alarm system, data acquisition software 
and data conversion software. Develop and im¬ 
plement data base management package for 
alarm system to integrate data analysis soft¬ 
ware, window display and manipulation software 
and memory management. Research or project 
background in IBM PC Internal MS DOS operat¬ 
ing system and interrupt service routines, in¬ 
teractive software development, system level 
data structure, and data base managment. 
Master’s or equiv. in Computer Science. 
$2,666/mo. 40 hrs./week. No exp. req’d. Job 
Site/Interview: San Francisco, CA. Clip this ad 
and send with resume to MLUff 3092, P.O. Box 
9560, Sacramento, CA 95823-0560 not later than 
July 30, 1987. 


SOFTWARE ENGINEER 

Responsible for porting implementation of the 
Warren Abstract Machine, and other portions of 
the Prolog system, to various Unix systems, 
VAX/VMS, and IBM mainframes. This work in¬ 
volves: 1) porting the Prolog system to various 
hardware architectures through the mainten¬ 
ance and adaptation of software in Prolog, C, 
MockLisp, Progol (a Quintus proprietary lan¬ 
guage), and assembly language; and 2) develop¬ 
ing interfaces from Prolog to FORTRAN, Pascal, 
and Lisp. Requires a B.A. degree in Computer 
Science. Also requires research or project ex¬ 
perience. Prolog implementations; excellent 
knowledge of Unix and VMS internals, as well as 
proficiency in Prolog, C, FORTRAN, Lisp, Pascal, 
and assembly language programming; and a 
strong background in logic programming and ar¬ 
tificial intelligence. Salary: $29,500/yr. 40/hrs per 
wk. Place of employment: Mt. View, CA. Send 
this ad and a resume to Job #MLU #2726, P.O. 
Box 9560, Sacramento, CA 95823-0560 not later 
than August 1,1987. 


SYSTEMS ANALYST 

To direct the design, development and im¬ 
plementation of graphics and Image processing 
software. Will also design writing device drivers 
and systems to port software among various 
machines/operating systems. We require a B.S. 
in Computer Science and 2 years experience as 
systems analyst or academic equivalent of 2 
years experience plus ability in FORTRAN, C and 
assembly languages, VMS, UNIX and PRIMOS 
operating systems. Sal: $34,056/yr. Interview/job 
site: Culver City. Send resume or letter explain¬ 
ing qualifications to: Job #WS4090, P.O. Box 
9560, Sacramento, CA 95823-0560 not later than 
Aug. 1. Must show authorization to accept per¬ 
manent employment in the U.S.A. 


UNIVERSITY OF THE DISTRICT OF COLUMBIA 
Computer Science 

The University invites applications for faculty 
positions in Computer Science. Ph.D. required 
for associate and full professorships. Master’s 
degree plus experience may be considered. 
Areas of specialization needed include: Com¬ 
piler Design and Theory, Digital Design, Com¬ 
puter Architecture, Microprocessor Applica¬ 
tions, and Data Structures. Salary and rank 
depend on qualifications. Submit complete 
resume to Ms. Christine Poole, Office of Person¬ 
nel, University of the District of Columbia, 4200 
Connecticut Avenue, N.W., Washington, D.C. 
20008. An Equal Opportunity Employer. 
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NEW PRODUCTS 



The ETA 10 installed at Florida State University has two CPUs (foreground) and a 
shared memory unit (background). 


ETA 10 supercomputers hit 
unprecedented speeds 

ETA Systems, a subsidiary of Con¬ 
trol Data, claims that its ETA 10 super¬ 
computer operates up to 30 times faster 
than predecessor supercomputers, with 
a peak performance rate of ten billion 
calculations per second. 

Each system contains one to eight 
central processing units, each CPU hav¬ 
ing both a scalar and a vector processor 
and local memory of four million 64-bit 
words. 

Shared memory in the ETA 10 serves 
up to 8 CPUs and 18 I/O units. The 
shared memory is available in sizes 
from 32 million words to 256 million 
words. 

The communication buffer provides 
memory common to all processors and 
allows them to communicate and coor¬ 
dinate activities. 

The system uses a 64-bit word length, 
with 32-bit and 128-bit word lengths 
available for applications requiring 
half-word or double-precision modes. 

The initial user environment is based 
on VSOS, or Virtual Storage Operating 
System, used on the Cyber 205. System 
V, based on AT&T’s Unix System V 
operating system, is planned, as is a 


Gold Hill Computers has announced 
GoldWorks, an expert system building 
tool for developing and delivering 
knowledge-based expert systems on 
Intel 80286- and 80386-based PCs. 

According to the company, Gold- 
Works combines a knowledge base; an 
inference engine with forward, back¬ 
ward, and goal-directed forward chain¬ 
ing; a multilevel open architecture with 
two programming and debugging envi¬ 
ronments; external interfaces to Lotus 
1-2-3, dBase, and C; a screen toolkit; 
and on-line tutorials and a help system. 

GoldWorks addresses up to 15M 
bytes of memory on the IBM PC-AT, 
14M bytes on the Compaq Deskpro 
386, and up to 24M bytes on Gold 


NOS/VE interface for Control Data’s 
machines. 

Prices vary with configuration, rang¬ 
ing from around $5.5 million for a 


Hill’s 386 Hummingboard. It also runs 
on IBM’s Personal System/2. Minimum 
system requirements are 512K bytes of 
base memory, 10M bytes of hard disk 
memory, at least 5M bytes of extended 
memory, and a CGA display adapter 
and monitor. 

GoldWorks will sell for $7500. Until 
July 31, 1987 it sells at the introductory 
price of $5000 for the IBM PC-AT and 
Compaq Deskpro 386. Bundled with 
the 386 Hummingboard, it will sell for 
$13,300, but is available for $10,800 
until July 31. 

Special training courses are also 
available. 

Reader Service 31 


single-processor system to over $22 mil¬ 
lion for an eight-processor system. 

Reader Service 30 


CM-2 handles lots of data 

Thinking Machines has added the 
CM-2 to the Connection Machine line 
of computers. According to the com¬ 
pany, the CM-2 is designed for applica¬ 
tions with large amounts of data. 

The CM-2 reputedly achieves 2500 
million floating-point operations per 
second and 2500 million instructions 
per second. 

The 64,000-processor CM-2 supports 
eight I/O channels operating at 40M 
bytes per second per channel. Total sys¬ 
tem memory is 512M bytes. Single or 
double precision floating point is 
offered. 

Prices for the CM-2 range from one 
to five million dollars. 

Reader Service 32 


Expert building tool runs on 286, 386 PCs 
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80386-based micro for 
under $2000 


Advanced Logic Research claims to 
have released the first 80386-based 
microcomputer priced below $2000, the 
ALR 386/2. 

The model 10 costs $1990 and fea¬ 
tures a 386 system board with 1M byte 
of 32-bit RAM expandable to 2M bytes 
on the system board, Phoenix BIOS, a 
1.2M-byte floppy disk drive, one serial 
port, one parallel port, and a 101-key 
expanded keyboard. 

Models 40, 80, and 130 all feature 2M 
bytes of 32-bit RAM, plus a hard-disk 
controller with track disk caching. 
Model 40 costs $3990; model 80, $4690; 
and model 130, $7299. 


Reader Service 33 



The ALR 386/2 is a second-generation 80386-based microcomputer from 
Advanced Logic Research for under $2000. 


Networking software manages files 


Ungermann-Bass offers a software 
package called Net/One Universal File 
Manager that reputedly provides 
universal file access and management 
between any combination of IBM and 
Apple Macintosh personal computers 
and IBM, Digital Equipment, and 
Hewlett-Packard mainframes and 
minicomputers. 

The software consists of a core prod¬ 
uct called the Universal File Transfer 
and two options, the Universal File 
Transformer and the Universal File 
Administrator. 

Universal File Transfer Software 
resides on hosts and PCs. The Universal 
File Transformer resides on the host, 


Digital Equipment offers three new 
software packages: VAX Software Pro¬ 
ject Manager, Data Transfer Facility, 
and VAX Document. 

VAX Software Project Manager soft¬ 
ware is a tool for planning, estimating, 
and managing medium-to-large soft¬ 
ware development projects. It runs on 
current VAX processors. A graphics 
terminal or VAXstation is required for 
graphical functions. The software is 
licensed from $2250 to $71,250. Deliv¬ 
ery is scheduled for August 1987. 

Data Transfer Facility is layered soft¬ 
ware that allows users to move informa¬ 
tion and files between a DEC VAX-based 
system and an SNA environment. It 
requires either VMS/SNA software or 
the DECnet/SNA Gateway to link the 


reformatting and translating data into 
host and PC formats. The Universal 
File Administrator resides on the host, 
controlling access to data. 

Prices for the Universal File Transfer 
begin around $1500 for minicomputers 
and $7000 for mainframes. Prices for 
the Universal File Transformer begin 
around $600 for minicomputers and 
$3000 for mainframes. Prices for the 
Universal File Administrator begin 
around $1000 for minicomputers and 
$4500 for mainframes. 

Universal File Manager software for 
PCs costs $225. 

Reader Service 34 


DEC and SNA environments. Server 
software ranges in price from $2100 to 
$21,000 depending on hardware config¬ 
uration. Client software ranges from 
$900 to $9000. DTF software residing 
on an MVS host system costs $20,000. 

VAX Document adds technical 
documentation capabilities to existing 
VAX systems. It provides tools for text 
creation, a standard markup language, 
text and graphics integration, revision 
control, and document formatting. 
Prices range from $1350 to $32,400 
according to hardware configuration. 


Project manager: Reader Service 35 
DTF: Reader Service 36 
Document: Reader Service 37 


CAE automates ASIC design 

The CAE Systems Division of Tek¬ 
tronix has announced two software 
products to aid in the design of 
application-specific integrated circuits. 

The Gate Array WorkSystem reput¬ 
edly provides design automation soft¬ 
ware for the creation of circuit designs 
on specific gate arrays. With performance- 
driven layout, the designer specifies cir¬ 
cuit design parameters on the 
schematic, simulates the design, and 
runs the Merlyn-G physical layout sys¬ 
tem in conjunction with Tektronix’s 
TurnChip module for the specific array 
to achieve the optimal layout. 

The TurnChip software modules pro¬ 
vide knowledge-based, automatic con¬ 
trol of the Merlyn-G gate array layout 
system. The modules contain specific 
layout information for each gate array 
family. TurnChip modules include 
arrays in the AMCC, NEC, and Fujitsu 
gate array families. 

The WorkSystem includes the follow¬ 
ing design software: Tektronix’s 
Designer’s Database Schematic Capture 
software, the HILO-3 Logic Simulation 
System, and the Merlyn-G automated 
physical layout software for gate arrays. 

The WorkSystem software runs on 
the DEC VAX-based computer family 
and the Apollo Domain family of work¬ 
stations. 

The Gate Array WorkSystem soft¬ 
ware costs $70,000. 

The Tektronix/AMCC Q3500S Turn- 
Chip, developed with Applied Micro 
Circuits, costs $3500. 

WorkSystem: Reader Service 38 
TurnChip: Reader Service 39 


Three new software packages out from DEC 
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Pyramid expands Series 9000, adds Cobol and Lisp 


Pyramid Technology has announced 
the addition of two new computers to 
the upper end of its Series 9000 line. 

The 9830 has three processors and the 
9840 has four. They serve up to 512 
users and support up to 32G bytes of 
mass storage and 128M bytes of main 
memory. 

Standard configurations of the 9830 
and 9840 systems cost $424,000 and 
$514,000 respectively. This includes 
32M bytes of memory, 32 RS-232 ports, 
a 470M-byte disk drive, one-half-inch 
streaming tape drive, system console, 
Ethernet, and an OSx license. 

An entry-level model, the 9805, pro¬ 
vides a single-processor system in the 
Series 9000. It comes with 4M bytes of 
memory, 470M bytes of mass storage, 
and 16 RS-232 ports. It costs $128,950 
and is field upgradable, as are other 
members of the Series 9000. 

Pyramid now offers Cobol 85, which 
adheres to the standards for Cobol set 
by the American National Standards 
Institute. The Cobol compiler will be 
available across Pyramid’s system prod¬ 


uct line in the third quarter of 1987 for 
$11,000 for the 98x/xe and Series 9000 
systems, and $8000 for the 90x and 
WorkCenter departmental systems. 

Pyrlisp, based on Common Lisp, 
runs across Pyramid’s superminicom¬ 
puter product line. It consists of an 
interpreter, compiler, and debugger. 

The language is available for $6000 on 
Pyramid’s superminicomputers. 
Upgrades to Pyrlisp are available for 
$3000. 

Pyramid now offers Pyrnet 
SNA/3270, licensed from Systems 
Strategies. The software enables 
Pyramid computers to emulate a variety 
of IBM devices. It runs on Pyramid’s 
intelligent synchronous communica¬ 
tions controller and uses IBM’s syn¬ 
chronous data link control communica¬ 
tions protocol. The software costs 
$4800, or $9695 for a package including 
the software and the ISC controller. 

Series 9000: Reader Service 40 
Languages: Reader Service 41 
Pyrnet: Reader Service 42 


Software emulates 3270 

Attachmate has announced 3270 
emulation micro-to-mainframe soft¬ 
ware products that work with IBM’s 
Personal System/2 and IBM PCs and 
compatibles. 

Extra! features application program 
interfaces (APIs) and universal connec¬ 
tivity options, including support of 
local area networks, coaxial cable, and 
remote mainframes through modems 
(planned). It allows up to four main¬ 
frame sessions at one time, windowing, 
printer emulation, a PC-DOS session, 
and support for IBM’s 3270 PC API, in 
addition to basic 3270 emulation. 

Extra! Entry-level reputedly provides 
essential 3270 terminal functions to the 
IBM PC and PS/2. Features include file 
transfer, screen print, and program 
interfaces. It connects to the mainframe 
through coaxial cable or a LAN. 

Extra! costs $425 and Extra! Entry- 
level costs $275. Upgrades are planned 
for October 1987. 

Extra!: Reader Service 43 
Entry-level: Reader Service 44 


Zenith portable PC features hard disk drive 


Zenith Data Systems has announced 
the Z-183, a portable personal com¬ 
puter with a lOM-byte hard disk drive. 

According to the company, the 
Z-183’s shock-mounted hard disk drive 
offers between two and five hours of 
battery operation, depending on disk 
access and battery size. 

The Z-183 includes an 80C88 16-bit 
processor operating at 8 MHz, switcha- 
ble to 4.77 MHz. The system comes 
standard with 640K bytes of RAM and 
16K bytes of video RAM. RAM can be 
expanded to 1.64M bytes using an 
optional add-on board (available later 
this year). 

Other features include serial and par¬ 
allel ports, RGB and composite mono¬ 
chrome video interface, battery pack, 

110/220 VAC power adapter/charger, 
socket for Intel 8087 math coprocessor, 
real-time clock and calendar, and MS- 
DOS 3.2. 

Suggested price for the Z-183 is 
$3499. 

Zenith Data Systems has also 
announced a computer monitor that the 
company claims has a completely flat 
screen with the brightest color computer 
display available and almost no reflec¬ 
tion. The monitor is compatible with 
the VGA graphics output of IBM’s Per¬ 
sonal System/2 computers. 



Zenith Data Systems’ ZCM-1490 flat-screen color monitor has almost no glare. 


The 14-inch Flat Technology Moni¬ 
tor, Model ZCM-1490, has a suggested Portable: Reader Service 45 

price of $999. Monitor: Reader Service 46 
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Software manages projects 


Ashton-Tate upgrades MuItiMate 


Project Software & Development 
(PSD1) has announced Qwiknet Profes¬ 
sional, a project management software 
program for the IBM PC-AT, PC-XT, 
and compatibles. The company calls the 
product a management tool for profes¬ 
sional project managers and sophisti¬ 
cated users who manage complex 
projects. 

Qwiknet employs critical path 
scheduling, time-limited resource 
scheduling, and priority scheduling. 

The program is multiple project and 
multiple calendar, measuring actual 
performance in terms of schedules, 
expenditures, and resource usage. It 
produces project management reports 
such as bar charts, histograms, and 
logic diagrams. 

A proprietary window environment 
can be manipulated with a mouse, or 
activated using preprogrammed func¬ 
tion keys. 

Qwiknet Professional costs $1495. 
The PSDI mouse is optional, for $125. 

Reader Service 47 


Oracle software 
extends PC limits 

Oracle has announced three software 
products that reputedly overcome the 
memory limitations of the DOS operat¬ 
ing system. 

Professional Oracle according to the 
company places the database manage¬ 
ment system kernel in protected mode 
above the DOS-imposed 640K-byte bar¬ 
rier, leaving over 500K bytes of real 
memory available for application code 
and coresident utilities. It costs $1295. 

LANserver Oracle is a dedicated, dis¬ 
tributed multiuser DBMS server for 
286/386 PC LANs. The product kernel 
and data both reside in one PC, accessi¬ 
ble by applications and tools running 
elsewhere on the LAN. Available in the 
fourth quarter of 1987, it will cost 
$2495. 

Networkstation Oracle allows PC 
users to access and update a remote 
database. It costs $695. 

Professional Oracle and Networksta¬ 
tion Oracle include SQL‘Forms, 

SQL‘Calc, SQL ‘Plus, and 
SQL‘Report. 

Professional: Reader Service 48 
LANserver: Reader Service 49 
Networkstation: Reader Service 50 


Ashton-Tate has announced Multi- 
Mate Advantage II, an upgraded ver¬ 
sion of the company’s MuItiMate word 
processing software. 

New features include the option of 
document or page orientation; an 
optional, pull-down menu interface; a 
merge with dBase files without leaving 
the word processor; a continual undo/ 
delete function to retrieve deleted text; 
and laser printer support that allows up 
to 26 fonts within a document and up to 
18 soft fonts. 

Other features include menu bypass, 
six-function math, autohyphenation, 


sorting within a document, single-key 
execution, and an FFT-DCA conversion 
feature for FFT-DCA-formatted files. 

MuItiMate Advantage II runs on the 
IBM PC, PC-XT, PC-AT, and compat¬ 
ibles with a minimum of 384K bytes of 
free memory for DOS 2.0 or higher, 
two double-sided disk drives, or a hard 
disk and one double-sided disk drive. 

MuItiMate Advantage II costs $565 
for the 5K-inch formatted disk version, 
or $595 for a combined package that 
also includes a 3^-inch formatted disk. 

Reader Service 51 


Apollo claims network independence 


Apollo Computer has announced that 
its Domain workstation computing 
environment is now network indepen¬ 
dent and can run directly on industry 
standard Ethernet networks. 

The Ethernet connection is achieved 
through Apollo’s 802.3 Network 
Controller-AT, a controller board com¬ 
patible with the Ethernet/IEEE 802.3 
network standard. 

Domain Series 3000 Workstation 
users have the option of employing the 


Ethernet network interface or the com¬ 
pany’s original controller for Apollo’s 
token ring network. 

The price of a Domain Series 3000 
Workstation or DSP3000 server 
includes either a single 802.3 Network 
Controller-AT or an Apollo Token 
Ring Network Controller-AT. Both can 
be purchased separately as an add-on 
for $2000 each. 

Reader Service 52 


Supercomputers employ VLIW architecture 


Multiflow Computer has introduced 
the Trace family of supercomputers for 
compute-intensive applications. Accord¬ 
ing to the company, Trace systems are 
single-processor machines based on a 
very long instruction word (VLIW) 
architecture and Trace Scheduling com¬ 
pacting compilers. 

The 64-bit systems run an enhanced 
version of Berkeley 4.3 Unix and reput¬ 
edly incorporate industry standards to 
ensure compatibility with other systems. 

The seriesTncludes three models rang¬ 
ing in price from $300,000 to $1 mil¬ 
lion. The systems can be used alone or 
as compute servers in networks. 

The Trace 7/200 features seven oper¬ 
ations in one 256-bit instruction word. 
The Trace 14/200 features 14 opera¬ 
tions in a 512-bit instruction word. The 
Trace 28/200 features 28 operations in 
one 1024-bit instruction word. The 
14/200 and 28/200 will be available in 
the fourth quarter of 1987. 

The entry-level Trace system costs 
$299,500 and includes the Trace 7/200 


CPU, 16M bytes of main memory 
expandable to 512M bytes, VME I/O 
processor, disk controller, 515M-byte 
Winchester drive, cartridge tape unit, 
video console terminal, Trace Schedul¬ 
ing Fortran compiler, and Trace/Unix 
operating system. 

Reader Service 53 



The Multiflow Trace 7/200 is field- 
upgradable to the more powerful Trace 
supercomputer systems. 
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Ada compiler runs on Modular Design Environment aids ASIC design 

Deskpro 386 


Alsys has announced an Ada com¬ 
piler that reputedly exploits protected 
mode on the Compaq Deskpro 386. 

The Alsys 386 Ada Compiler runs 
under DOS 3.1 and supports the use of 
protected mode, with direct access of up 
to 16M bytes of main memory. 

The compiler is derived from Alsys 
PC AT Ada, which supports protected 
mode on the 80286. 

No price given. 

Reader Service 54 


80386-based supermicro 
runs Unix and DOS 

Prime Computer has announced the 
Prime EXL316 supermicrocomputer, 
which reputedly runs Unix and MS- 
DOS programs simultaneously. 

Based on the Intel 80386 microproces¬ 
sor, the EXL 316 system runs Prime’s 
implementation of AT&T’s Unix Sys¬ 
tem V.3 operating system and a soft¬ 
ware package called Merge 386 that 
allows users to run character-mode MS- 
DOS and Unix applications at the same 

According to the company, the EXL 
316 supermicro has been measured at 
3.31 million single-precision Whetstones 
and 2.07 million double-precision 
Whetstones using an extended math 
coprocessor and an optimized C 
compiler. 

The Prime EXL 316 system commu¬ 
nicates with Prime 50 Series super¬ 
minicomputers and other systems 
through IEEE 802.3 standard Ethernet 
running Transmission Control Pro¬ 
tocol/Internet Protocol (TCP/IP) 
software. 

The EXL 316 comes in a single cabi¬ 
net with one CPU, three Multibus II 
expansion card slots, and space for a 
tape drive and two disk drives. A dual¬ 
cabinet version provides seven expan¬ 
sion Multibus II card slots and space 
for four disk drives. 

A standard configuration costs 
$23,900 and includes the operating sys¬ 
tem, 2M bytes of memory, a 90M-byte 
formatted disk, a 60M-byte streaming 
tape backup subsystem, and 10 asyn¬ 
chronous lines. 
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LSI Logic calls their integrated sys¬ 
tem of software tools for the design of 
application-specific integrated circuits 
the Modular Design Environment. The 
three major elements are the Logic 
Integrator, the Silicon Integrator, and 
the System Integrator. 

The entry-level Logic Integrator tar¬ 
gets the design, development, and anal¬ 
ysis of single-chip ASICs requiring up 
to 6700 usable gates. The software sys¬ 
tem includes macrocell and macrofunc¬ 
tion libraries. It currently runs on the 
Sun 3 Series of workstations under 
Unix, with MicroVAX and Apollo ver¬ 
sions planned. Logic Integrator is 
offered on an annual license basis, with 
updates and improvements included in 
the maintenance license. 

The Silicon Integrator supports both 
cell-based and array technologies for 
complex ASIC designs. A CAE system, 
it includes schematic capture, logic 
simulation, timing verifiers, logic syn¬ 
thesizers, silicon compilers, floorplan- 
ning, fault grading, test program 
generators, and bonding pad modules. 


Migent has announced Ability Plus, 
an integrated software package for the 
IBM PC and compatibles. The core 
applications include a relational data¬ 
base, a spreadsheet, word processing, 
graphics, and communications. 

According to the company, notable 
features include relational links, 
dynamic linking, seamless integration, 
an extended field metaphor, and the 
library screen navigation system. 

Relational links allow an extended 
field to serve as a joining field to a data¬ 
base. Dynamic linking allows changes 
in one application to reflect in another 
application that references the same 


Westminster Software has announced 
Pertmaster Advance, a project manage¬ 
ment software product that runs on 
IBM personal computers (PC, XT, 

AT), true compatibles, and IBM Per¬ 
sonal System/2 models with 320K bytes 
or more of RAM and two disk drives. 

According to the company, the soft¬ 
ware is designed to plan, schedule, and 
manage projects with multiple inter- 


It runs on a variety of workstations and 
mainframe platforms under Unix, 

VMS, and VM/CMS operating systems. 
Silicon Integrator is offered on an 
annual license basis. 

The System Integrator combines 
system-level modeling and simulation 
software with process-driven ASIC 
development tools. Composed of 
CAE/CAD building blocks, it targets 
complex multichip systems of more 
than 100,000 gates. It includes BS1M, a 
system-modeling behavioral simulator; 
MSIM, a gate-level multichip simulator; 
and Accel Models, a support library of 
industry-standard microprocessors, 
peripheral circuits, and memories, plus 
LSI Logic megacells, standard parts, 
and macrofunction models. Behavioral 
Simulation Language supports the defi¬ 
nition of behavioral models, while Gate 
Modeling Language supports the defini¬ 
tion of gate models. 
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data. Seamless integration allows appli¬ 
cations to be displayed in common 
rather than segregated environments. 
The extended field metaphor permits 
the creation of a field in a document in 
much the same way as in a database. 
The library screen navigation system 
allows users to shift between applica¬ 
tions while remaining within Ability 
Plus. 

Other features include a spell 
checker, macros, EGA support, and a 
printer driver creation program. 

Ability Plus costs $199. 
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related tasks to be accomplished within 
given time and budgetary constraints. 

Output on a printer or plotter 
includes logic diagrams from critical 
path analysis and Gantt charts from 
time-related analysis. Users may filter 
data to show specific project elements. 

Pertmaster Advance costs $1495. 
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Integrated software runs on IBM PCs 


Project managing software for IBM PCs 
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Database employs optics 


Image Management Systems has 
announced Laserbase I, an optical data 
processing system that according to the 
company integrates word processing 
documents, handwritten notes, engi¬ 
neering drawings, or microfilm into 
electronic file folders independent of 
data source or format. 

IMS claims that the product’s open 
network architecture allows it to be 
used as the database in a multivendor 
environment. Laserbase I’s native net¬ 
work is EthernetDtmU. 

Access is by means of a mouse and 
window format. The software stores 
acquired images on optical disk, with 
the number of electronic file folders 
unlimited. 

Laserbase I posts all system access 
attempts, including insertions and edit¬ 
ing of data. 

Laserbase I uses optical disk in write- 
once-read-many (WORM) format. 
Every revision is permanently archived. 

The product runs with Sun 
Microsystems’ operating system and 
Unix, and is compatible with most envi¬ 
ronments, according to the company. 
Price depends on the configuration of 
the hardware used, starting at under 
$100,000. 
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Multitasking kernel 
targets IBM PCs 

Andyne Computing has announced 
PCMascot, a real-time multitasking 
kernel for the IBM PC. According to 
the company, the software allows the 
real-time system developer to produce 
application systems that use concur¬ 
rently executing programs to solve con¬ 
trol or data acquisition problems. 

According to the company, the pro¬ 
grammer has access to the facilities of 
PC-DOS as well as shared memory seg¬ 
ments for intertask communications, 
synchronization and mutual exclusion 
primitives, and real-time execution of 
application programs. 

Applications can be coded in C or 
assembler and written using different 
compilers or program models. When 
buying the program, the user must 
specify the C compilers to be used. 

PCMascot costs $795, plus $50 for 
the manual and a demonstration 
program. 
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Software helps system recognize poor-quality characters 


Cognex offers a Character Recognition 
Software Toolkit for the Cognex 2000 
single-board machine vision system. 

The software is based on the Cognex 
DataMan 1000 Character Recognition 
system. 

The vision system with the new soft¬ 
ware reportedly recognizes characters 
that are stamped, embossed, etched, 
engraved, or printed directly on a part. 
It reads characters from metal, plastic, 
ceramic, glass, and black rubber surfaces. 

The software is user trainable, 
according to the company. Showing the 
system examples of the characters used 
in-plant teaches the system to read the 
type styles or fonts, whether alphanu- 
merics or symbols or part outlines. 

The system also recognizes shaded 


Workstation debugs VLSI 

Sentry Schlumberger’s Advanced 
Products Group has announced the 
Integrated Diagnostic System 5000 (IDS 
5000) for VLSI diagnostics and debugging. 

According to the company, the IDS 
5000 combines scanning electron micro¬ 
scope (SEM) technology and CAD/CAE 
tools in a workstation environment. 

This reputedly allows logic designers to 
observe behavior of the actual circuit 
while monitoring circuit connectivity 
and predicted behavior through the 
schematic netlist, layout database, and 
simulation tools. 

Four diagnostic tools are displayed as 
windows on a color-graphics monitor. 
The netlist tool permits browsing of a 
netlist text file. The layout tool displays 
the chip CAD database. The SEM tool 
acquires, averages, and displays voltage 
contrast images. The scope tool permits 
waveform acquisition through auto¬ 


characters, tolerates image degradation, 
contrasts fluctuation within the image, 
and ignores extraneous information 
adjacent to the characters of interest. 
According to the company, the latter 
ability results from the system’s use of 
grey-level normalized correlation as the 
method to locate characters. 

The software generally would be pur¬ 
chased by original equipment manufac¬ 
turers as part of a development system, 
or by individual companies for specific 
applications. 

Prices vary according to quantity 
purchased and configuration of the tar¬ 
get system. Contact the company for 
information on pricing and availability. 
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The IDS 5000, shown here with the test 
chamber open, features an inverted 
column for direct electrical access to the 
device under test. 

mated control of the electron-beam 
pulsing circuits. 

The US list price of the IDS 5000 is 
$495,000. 
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Industrial computer is AT compatible 


Xycom has announced the 4150 
Industrial Computer System. According 
to the company, the 4150 is fully com¬ 
patible with the IBM PC-AT. 

The 4150 includes a five-slot AT pas¬ 
sive backplane, an EGA/CGA color 
monitor, data entry and function key¬ 
pads, hard and floppy disk facilities, 
and expansion capabilities. 

The front panel is sealed to NEMA 
4/NEMA 12 standards, and the CRT is 
protected by an impact-resistant lexan 


shield. The cardcage, power supply, 
and disk facilities are located in a pull¬ 
out drawer. 

The unit can be ordered with or with¬ 
out PC boards, with or without disk 
drives, with or without a full-size sealed 
keyboard, and with a specific processor 
(8088, 80286, or 80386). 

The standard 4150 without options 
costs $3600. 
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Tandem has new computers, 
laser printer 

Tandem Computers has announced 
two new distributed computer systems 
and a desktop laser printer. 

The NonStop CLX systems, based on 
the Tandem Guardian operating sys¬ 
tem, target on-line transaction process¬ 
ing networks. Single quantity pricing 
starts at $57,000. 

The 32-bit LXN multiuser system is 
Unix-based and expandable from one to 
three processors. Single quantity pricing 
starts at $23,700. 

The Laser-LX printer works with 
Tandem distributed systems and work¬ 
stations. It prints eight pages per minute 
and is compatible with Hewlett-Packard’s 
LaserJet Plus printer. Available in the 
third quarter of 1987, it will cost $2595. 

According to the company, CLX and 
LXN systems will extend networks of 
larger Tandem NonStop systems. 
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Minis use multi 80386s 


Commercial Systems (CSI) has 
announced the HS Series of minicom¬ 
puters based on Intel’s 80386 CPU chip. 

All four of the planned HS Series 
employ Unix V.3, which has been 
ported to the 80386 chip running in a 
multiuser environment. According to 
the company, all of the series are com¬ 
patible. 

The HS-1000 with one CPU supports 
up to 16 users. Fully configured, it con¬ 
tains two CPUs, supports 32 users, and 
provides 32M bytes of RAM, 1520M 
bytes of internal hard disk storage, 
streaming tape backup, and support for 
up to 16 peripheral devices. It reputedly 
achieves four to five million instruc¬ 
tions per second. Prices start at 
$25,500. 

The HS-2000 uses up to eight 80386 
processors. It supports up to 128 users. 

It reputedly achieves over 32 MIPS. 
Prices range from $34,000 to $250,000. 
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Commercial Systems’ HS-1000, shown 
here, reaches four to five MIPS and 
supports up to 16 users with one CPU 
or 32 users with two CPUs. It is based 
on Intel’s 80386 chip. 


ANNUAL IEEE DESIGN AUTOMATION WORKSHOP 


Call for Participation 

Sponsored by 

the IEEE Design Automation 
Technical Committee 

Issues in Design 
Representation: 
Languages, Models, and 
Data Structures 

Gold Canyon Resort, 

Apache Junction, Arizona 
January 13-15,1988 


Purpose of the Workshop: 

The focus of the 1988 Design Automation Workshop will be Issues in Design Representation with emphasis on 
the strengths and weaknesses of various representational systems, and how they influence the choice and 
implementation of Design Automation tools. Presentations are sought with regard to all levels of representation 
including: Specification, Behavior, Register, Logic, Circuit, and Geometric. 

Issues of Interest Include: 

Influence of Applications: 

Specification Synthesis Simulation Verification Test 
Textual vs Graphical Representation 
Hierarchical vs Unified Systems 
Readability and Understandability 
Influences of Machine Architecture 
Object-Oriented vs Traditional Data Bases 
HDL Design Libraries and Intermediate Forms 
Participation in the Workshop: 

Attendance at the workjshop is limited to 55 persons. To participate, please submit a short summary of your 
interest and activities. Sessions will be arranged to include short presentations of positions or speculations, as 
well as longer presentations to report activities and results. If you would like to make a presentation at the 
workshop, also submit a brief summary describing your proposed talk. Send this information to the workshop 
chairman, Walling Cyre. 

If you have suggestions for session themes or would like to organize a session, contact the program chairman, 
Barry Winston. 


THE COMPUTER SOCIETY 

X^OF THE IEEE 

1730 Massachusetts Avenue, N.W. 
Washington, DC 20036-1903 


Workshop Location: 

The Gold Canyon Resort is located in thescenic Superstition Mountains, 25 miles east of Phoenix, Arizona. The 
location thus provides a relatively secluded location for the workshop, yet is convenient to a major airport. 
Recreational activities include horseback riding, hiking, swimming, golf, and tennis. The elevation of the resort 
is 1715 feet. The climate is very sunny and warm, and the humidity is very low. 

More Information: 


If you wish more information about the workshop, contact the workshop chairman. Also, as 
workshop draws near, you will receive a detailed program for the workshop as well as 
registration at the Gold Canyon Resort. 

Workshop Chairman: Program Chairman: 


Dr. Walling Cyre 

Corporate Research and Engineering 
Control Data Corporation, HQM173 
Box 1249 

Minneapolis, MN 55440 
(612) 853-2692 


Barry Winston 

Large Systems Group 
UNISYS, M.S. WEI D 
Box 64942 

St. Paul, MN 55164-0942 
(612) 635-2333 


the time for the 
information on 
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Microsystem Announcements 


Company, Model, Function _ Comments _ R-S. No. 


Boca Research 

EGA by Boca 

EGA board 

Provides the capabilities of the IBM Enhanced Graphics Adapter Card, plus backward 136 

compatibility to the Hercules Graphics Card, Color Graphics Adapter, and Monochrome 

Display Adapter. Uses Chips & Technology’s chipset. Software includes screen-saving 
functions, mode change, diagnostic routines, and so forth. Cost: $199. 

Fastcomm Data 

Fastcomm Turbo 

Modem 

A modem that adds 19.2 kbps capability to a Fastcomm 2496 or 9600 modem. Designed 137 

for unidirectional large-file transfer applications. Available in two models: a 9600 Turbo 
supporting 19200, 9600, 7200, 4800 data rates, or a 2496 Turbo with 19200 through 4800 
bps and Hays-compatible 2400, 1200, 300 capability. Cost: Ranges from $999 to $1099. 

Upgrades of 2496 or 9600 modems for $100. 

Fujitsu America 
M3727MA 

Laser printer 

Operates at 18 pages per minute at 300 dots-per-inch resolution. Designed to print 25,000 138 

sheets per month. Has two 250-sheet cassettes for automatic feeding, with a 1000-sheet 
capacity hopper as an option. Accommodates legal, letter, A4, B4, and B5 paper sizes. 

Emulates Epson FX80, Diablo 630, and Hewlett-Packard LaserJet Plus. Cost: $7950. 

Heurikon 

HK68/M120 

Single-board computer 

A single-board microcomputer based on the Motorola 68020. Features up to 4M bytes of 139 

DRAM with parity, 128 bytes of RAM, up to 256K bytes of EPROM, two RS-232 serial 
ports (RS-422 optional), ANSI-compatible SCSI interface, Centronics interface, single 

16-bit iSBX connector, three 16-bit counter/timers, iLBX for memory expansion, mailbox 
interrupt support, and master/slave interface to Multibus I. Cost: $1995 (100’s) with 1M- 
byte DRAM. 

IDEAssociates 

IDEAtape 

Tape drive 

A 60M-byte tape drive compatible with PC systems running at speeds up to 16 MHz, 140 

including the Compaq 386 and IBM’s Personal System/2 Model 30, PC, PC-XT, and PC- 
AT. Available in external and internal configurations. Permits storage of multiple backups 
on a single tape. Also allows different types of backups, such as image-based or file-by- 
file, to reside on the same tape. Cost: $1775 (internal) or $2395 (external). 

Maximum Storage 
APX-3000 Series 

Optical drives 

Targets IBM PC, PC-XT, PC-AT, and compatibles. Available in host-mountable and 141 

external stand-alone configurations. Software compatibility with PC/MS-DOS versions 2.0 
through 3.2. Features a spindle brake and a face-plate design that permits host-mounting 
of a single drive model across the PC line. Attaches to the computer via a controller card. 

Data transfer at 2.5M-bits/second at a disk rotational speed of 1800 rpm. Cost: $2695. 

Plus Development 
Hardcard 40 

A 40M-byte hard-disk expansion board for IBM PCs, including Personal System/2 Model 142 

30, and compatibles. Occupies one expansion slot. Uses two 3.5-inch thin-film disks with 


Hard-disk expansion board four minicomposite heads to provide 42.26M bytes of formatted storage capacity. Features 
a rotary voice coil actuator and a mean-time-between-failure rate of 40,000 hours. Cost: 


Qualstar 

Ministreamer 

Tape drive 

$1195. 

A 9-track, half-inch reel-to-reel tape drive with SCSI controller and Macintosh SCSI tape 143 

utility software for Apple Macintosh systems, including Macintosh SE and Macintosh II. 

Drives Xerox 9700 laser printers. IBM/ANSI compatible. Requires Monaco 9- and 

12-point fonts to support terminal output. Cost: $3995; $500 for MacSCSI tape utility 
software. 

RDK Instruments 

Model RY 5200 

Plotter 

An eight-pen digital plotter for A, B, C, and D size drawings. Handles fiber, ink, ceramic, 144 

or bail-point pens. Features an automatic capping function and manual or software- 
controlled pen changes. Pen speeds of 20 IPS and resolution of 0.001 inches. Standard 

128K-byte buffer memory with 256K-byte auxiliary memory available. Interface via an 

RS-232-C port (BP-IB optional). HP-GL-compatible language. Cost: around $6000. 

Ready Systems 

VRTX32 

Multitasking kernel 

A real-time, multitasking kernel for use as a standard software component. Designed for 145 

32-bit microprocessors, including the Motorola 68020 and Intel 80386. Also available for 
the Motorola 68000 and 68010. Upwardly compatible with Ready Systems VRTX. Cost: 

$6775. 

Wyse Technology 

WY-995 

Interface board 

An eight-channel serial interface board that allows up to eight terminals, modems, printers, 146 
and other serial devices to be interfaced with IBM PC-AT-compatible systems. Uses stand¬ 
ard RJ-11 connectors. Supports continuous bidirectional data transfer at a rate of 38.4K 
baud. Features a Z80B processor running at 4.9 MHz and 48K bytes of dual-port RAM. 

Cost: $699. 
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1C Announcements 


Company, Model, Function Comments R.S. No. 


Analog Devices 
AD9002 
A/D converter 


Analog Devices 
AD7590DI; 
AD7591DI; 
and AD7592DI 
CMOS switches 


Fabricated in a bipolar process, this 8-bit analog-to-digital converter has a sampling rate of 120 
15,000,000 samples per second, a typical power dissipation of 750 mW, and an input 
capacitance of 17 pF. Its analog signal bandwidth at - 3 dB is 115 MHz; at 1 MHz its 
signal-to-noise ratio is 43 dB and its harmonic suppression is 63 dB. The maximum 
differential nonlinearity of the converter is KLSB. Cost (100’s) for the %LSB grade is $90; 
for the ^LSB grade, $135. 

These dielectrically isolated switches offer ±25-V overvoltage protection and on-chip data 121 
latches. Consisting of four independent single-pole, single-throw analog switches and pack¬ 
aged in a 16-pin DIP, the AD7590 and 7591 models also feature a maximum switch-on time 
of 250 and 400 ns and switch-off time of 400 and 250 ns, respectively. Consisting of two 
independent SPST switches and packaged in a 14-pin DIP, the AD7592 has a transition 
time of 600 to 350 ns maximum. The cost in 100’s starts at $4.95. 


Gould Electronics 
S614381 
CMOS ALU 


Motorola 
MCM6269 
Static RAM 


This 16-bit cascadable arithmetic/logic unit integrates four LS381-type 4-bit ALUs with an 122 
LS182-type carry look-ahead generator and performs addition, subtraction, and logical 
functions with status and carry signals for its cascadable functions of carry, propagate, and 
generate. The chip has input and output registers with a transparent mode and an output 
feedback path. Packaged in a 68-pin, J-lead plastic leadless chip carrier. Price (1000’s): 

$19.95. 

This 4K x 4 CMOS static RAM has an access time of 35 ns, operates under a single 5-V 123 

supply, and features a low active ac power operation of 90 mA maximum and 40 mA maxi¬ 
mum under dc conditions. Conforming to the JEDEC standard pinout, it is available in 20 
lead plastic dual-in-line package. Price (100’s): $5.78. 


NCR 

53C300 

SCSI controller 


Precision Monolithics 

PM-7548 

CMOS DAC 


This CMOS, single-chip, small computer system interface protocol controller features an 124 
integrated dual-ported buffer controller and handles buffer sizes from 256 to 64K bytes. It 
also features an on-chip address latch, supports arbitration and reselection, and incor¬ 
porates on-chip, high-current drivers (48 mA). It is supplied in an 84-pin PLCC package 
and operates from a single 5-V supply. Cost (1000’s): $21.50. 

This 12-bit multiplying CMOS digital-to-analog converter features a flexible digital inter- 125 
face and accepts 8-bit data bytes in a left- or right-justified, user-selectable data format. 
Specifications are ±'/ 2 LSB over temperature for integral and differential nonlinearity, gain 
error of ±1 LSB, and worst-case zero scale error less than 0.03 LSB. Prices (100’s) are from 
$7.58 for commercial-grade plastic (0- + 70°C) and $30.92 for the military grade (- 55 to 
+125 °C). 


RAD Data Communications This 40-pins asynchronous multiplexer can multiplex and demultiplex twelve 38,400-bps 126 

RJ-001 data channels for an effective operating rate of 2,457,600 bps. Features CDP, HDB3, or 

LSI multiplexer AMI modulation schemes and such diagnostic capabilities as local loopback, remote loop- 

back, local syn. indication, and remote sync, indication. Price (100’s): $59. 


Silicon Systems 

K222U 

Modem 


This 1200-bps modem has integrated 16C450-compatible UART function and features two 127 
control modes: dual-port and single-port. The dual-port mode allows interfacing with an 
8-bit parallel bus for parallel data transfer. Operating from a single +5-V source, it draws 
less than 75 mW, with a power-down mode that reduces idle power to 15 mW. Comes in 
40-pin DIP packages. Price (100’s): $30.10. 


Space Research Technology This series of CMOS FIFO dual-port memory chips offers access times of 120-ns 128 

SRT 4501 (4501-120), 80 ns (4501-80), and 65 ns (4501-65). It features a basic configuration of 512 x 

Dual-port memory 9 and uses ring counter pointers for internal read and write addressing. Prices start at $11 

for the 4501-120, $15 for the 4501-80, and $19.50 for the 4501-65. 


Standard Microsystems 
COM52C50 
Interface controller 


Conforming to the IBM 5250 standard used in the IBM System/36 and /38, this single-chip 129 
TWINAX controller automatically implements handshake and frame format as well as 
parity generation and error detection for communication over a distance of 5000 feet of 
TWINAX cable. It features 8-MHz clock, built-in diagnostics, and node addresses emula¬ 
tor. Price (100’s): $19.05 for the plastic DIP version. 
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Use the Reader Service Card to obtain the following material: 

• Membership information and application (RS #202) 

• Publications catalog (proceedings, tutorials, standards) (RS #201) 

• Periodicals subscription application/information for individuals 
(members, sister-society members, others) (RS #200) 

• Periodicals subscription application/information for organizations 
(libraries, companies, etc.) (RS #199) 

• List of awards and award nomination forms (RS #198) 

• Technical committee list and membership application (RS #197) 

• Directory of officers, board members, committee chairs, represen¬ 
tatives, staff, chapters, standards working groups, etc. (RS #196) 


Also see membership application in this magazine. 
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CALL FOR PAPERS 


HE INSTITUTE OF ELECTRICAL AND 


Conference on Office Information Systems 


Sponsored by • ACM-SIGOIS • IEEE Computer Society TC-OA 
In Cooperation with IFIP W.G. 8.4 
Palo Alto, CA—March 23-25, 1988 


New ideas from artificial intelligence, database design, and user interfaces promise 
substantial changes in the ways that people work together and acquire knowledge. 
Moreover, human organizations may provide important insights for the design of parallel 
processing and distributed artificial intelligence systems. Original research by technical 
experts will discuss progress, limitations, and implications of these topics. COIS is a 
conference devoted to intelligent processing of information in organizations. 


Specific topics of interest include: 

• Information Systems 

• Object-Oriented and Intelligent 
Databases 

• Social Processes 

• Multimedia/Hypertext Systems 

• Planning Systems 

• Voice/Video/Graphics 


• Effects of Technology on Human 
Organizations 

• Distributed Artificial Intelligence 

• Computer-Supported Cooperative Work 

• Organizational Design 

• User Models 

• Interconnect 


Keynote Speaker: Terry Winograd 

Information for authors: Submissions (max. 3500 words), by September 21, may be made 
either on paper (5 copies) or on some standard electronic medium. 

Dr. Robert B. Allen, 2A-367, Bell Communications Research, 

Morristown, NJ 07960 

phone: 201-829-4315 email: rba@bellcore.com 
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BOOK REVIEWS 


Editor: Wiley McKinzie, School of Computer Science and Technology, Rochester Institute of Technology, Rochester, NY 14623; Compmail, w.mckinzie; CSnet, wrm@rit 


Software Tools and Techniques for Embedded Distributed Processing 


Herbert C. Conn, Jr., David L. Kel¬ 
logg, David J. Rodjak, Rodney M. 
Bond, States L. Nelson, Scott L. 
Harman, Sue A. Johnson, William 
D. Baker, and Paul B. Dobbs (Noyes 
Publications, Park Ridge, N. J., 

1986, 684 pp., $72) 

Conn and his fellow authors attempt 
to identify the hardware and software 
technology pertinent to the implementa¬ 
tion of tightly coupled distributed sys¬ 
tems; establish an integrated approach 
to the total life-cycle software develop¬ 
ment period (correlating existing and 
near-term software engineering method¬ 
ology, techniques, and tools to each 
life-cycle phase); and define the func¬ 
tional design requirements for far-term 
development of needed software engi¬ 
neering methodology, techniques, and 
tools. 

The book consists of three reports 
prepared in June 1983 by General 
Dynamics Corporation for the US Air 
Force Systems Command Rome Air 
Development Centers. The three reports 
are “Distributed Processing Tools 
Definition—Hardware and Software 
Technologies for Tightly-Coupled Dis¬ 
tributed Systems,” “Distributed 
Processing Tools Definition—Application 
of Software Engineering Technology,” 
and “Distributed Processing Tools 
Definition—An Integrated Software 
Engineering Environment for Dis¬ 
tributed Processing Software Devel¬ 
opment.” 

In the first report on hardware and 
software technologies, the Ada lan¬ 
guage is identified as being of primary 
importance to embedded distributed 
processing systems. The book concen¬ 
trates on Ada and its impact on the 
software development life cycle, in par¬ 
ticular the evolution of Ada Program¬ 
ming Support Environments (APSEs). 
Conn et al. strongly suggest that tools 
should be integrated with the environ¬ 


ment, that is, an integrated software 
support environment (ISSE), and that 
the software life-cycle support tools for 
embedded distributed processing sys¬ 
tems should view hardware and soft¬ 
ware from a total-systems perspective. 

There are, however, significant prob¬ 
lems with recognizing these ambitious 
goals. A specific example is the issue of 
types in the environment database. 
Existing mechanisms for tool integra¬ 
tion in programming support environ¬ 
ments such as Unix rely on simple byte 
streams. This is similar to languages 
that do not have strong typing. The 
people working on the Common APSE 
Interface Set (CAIS) have debated this 
and other issues related to tool integra¬ 
tion extensively and have not arrived at 
a satisfactory conclusion. Many feel 
that we do not understand the implica¬ 
tions of these issues well enough to 
make definitive decisions at this time. 

In 1983 we were only beginning to face 
these issues; APSE discussions centered 
on the Stoneman requirements 
(“Requirements for Ada Programming 
Support Environments: Stoneman,” 
DoD, Feb. 1980). At that time there 
were no validated Ada compilers, and 
there were two divergent APSE devel¬ 
opment efforts: the Ada Integrated 
Environment and the Ada Language 
System. This book predates much of the 
useful work in the APSE area. 

The concept of object-oriented 
modularization is introduced as a 
system-level viewpoint of both hard¬ 
ware and software that provides a cen¬ 
tral theme for the analysis, design, and 
implementation of embedded dis¬ 
tributed processing systems. Object- 
oriented design was popularized in the 
Ada world by Grady Booch ( Software 
Engineering with Ada, Benjamin/ 
Cummmings, Menlo Park, Calif., 

1983). The key feature of the object- 
oriented methodology is that each 
object executes without needing to 


know the internal details of other 
objects. An object is simply defined as 
an entity that contains information in 
an organized manner. Access to objects 
is strictly controlled on a need-to-know 
basis; information developed on one 
level of design is shielded from use on 
another level of design. Ada supports 
object-oriented design primarily 
through packages. The package specifi¬ 
cation contains the interface to an 
object, but the package body imple¬ 
menting the manipulations on the 
object is not directly accessible. The 
authors suggest that object-oriented 
design requires language support. While 
such language support as Ada packages 
is desirable, a number of projects have 
successfully applied object-oriented 
design using an Ada/PDL and imple¬ 
mented the design in a traditional lan¬ 
guage, for example, Fortran. As a 
general rule, design methodologies are 
language independent. 

The authors also suggest that the use 
of Ada in the distributed environment 
needs analysis, and discuss some global 
time issues related to distributing the 
Ada rendezvous. Since the topic of the 
book is building distributed systems, 
and Ada is the chosen implementation 
language, it seems this issue should have 
been studied in much more detail. Volz 
(“Some Problems in Distributing Real¬ 
time Ada Programs Across Machines,” 
Ada in Use, Proc. Ada Int’l Conf., 
ACM Ada Letts., Vol. 5, No. 2, Sept. 
1985, pp. 72-84) and Paulk (“Problems 
with Distributed Ada Programs,” Proc. 
Fifth Phoenix Conf. Computers and 
Commun., IEEE, New York, March 
1986) among others have identified a 
number of problems with distributing 
Ada programs. The solution chosen to 
solve a particular problem may have 
serious implications for the real-time 
performance of a distributed system. 

The second report on the application 
of software engineering technology dis¬ 
cusses the increased distribution of 
hardware, control, and databases; inte¬ 
grated support environments; and near- 
term generic tools and how they fit into 
the life cycle phases of software. I 
found this report comparatively superfi¬ 
cial and uninteresting. 


Recently published books and new periodicals may be submitted for review 
to editor at the address given above. 

Note: Publications reviewed in this section are not available from the IEEE 
Computer Society; they must be ordered directly from the publisher. 
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Modula-2 Discipline & Design 


The third report on an integrated 
software support environment for 
developing distributed processing soft¬ 
ware provides a functional specification 
of an integrated software support envi¬ 
ronment (ISSE) and describes the effect 
of distributed processing on the tools in 
the ISSE. The functional specification 
of each tool contains a brief description 
of similar existing tools and suggested 
improvements. I found this discussion 
very interesting since it describes an 
ideal generic tool set that does not exist 
in any current programming environ¬ 
ment, but the discussion of existing 
tools and methodologies was disap¬ 
pointing. There is probably less than 
two pages on the Software Require¬ 
ments Engineering Methodology 
(SREM), even though SREM was spe¬ 
cifically developed under US Army con¬ 
tract to address real-time, embedded, 
and distributed types of system. (SREM 
has since evolved to DCDS—the Dis¬ 
tributed Computing Design System.) 

Conn et al. largely achieve their goals 
in the context of the state of the art in 
1983, although I would wish for a much 
more extensive discussion of existing 
tools. The discussion of tools and 
methodologies is at a theoretical and 
generic level and is of little use to a soft¬ 
ware practitioner. The discussion of 
tools with relation to distributed sys¬ 
tems was of value, and I plan to apply it 
to some of my own work in the area; 
distributed systems are a deferred topic 
in most existing APSE work such as the 
CAIS. The discussion of generic tools 
would interest a tool builder or an 
ISSE/APSE designer; but even though 
the book is clearly written and does not 
require great technical sophistication, it 
would be of limited interest outside this 
specific area. Graduate students might 
find it useful as a reference in a course 
on software engineering. 

I found the presentation of the mate¬ 
rial annoying because the book is type¬ 
set from the original three reports. The 
reason is to cut down on the time 
between research and general publica¬ 
tion, but in a fast-developing field such 
as APSEs, three years (four by the time 
this review appears) renders work 
obsolescent. If the book had been 
rewritten and updated, it could have 
been tightened significantly. If it had 
been retypeset, it could have been 
smaller and easier to read; some of the 
tables were practically illegible. I could 
not recommend spending $72 for the 
book when anyone interested in it can 
probably obtain the original reports 
from RADC. 

Mark C. Paulk 

Systems Development Corporation 


Arthur Sale (Addison-Wesley, Read¬ 
ing, Mass., 1986, 452 pp., $25.95) 

Modula-2 is an outgrowth of Pascal 
and Modula (Pasula ?). Designed in the 
early 1970’s by Niklaus Wirth, Pascal 
was created with the primary purpose of 
introducing new programming tech¬ 
niques such as top-down design and 
structured programming to beginning 
programmers while retaining a rela¬ 
tively clean language design. As the 
groundswell of enthusiasm for Pascal 
migrated from the academic world to 
the nonacademic world, it was quickly 
realized that a teaching language is not 
suitable for real-world programming. 
Missing are such features as separate 
compilation modules, conformant array 
schemas, and the ability to call system 
routines within a program. Much to the 
joy of Pascal programmers everywhere, 
compiler designers came to the rescue 
and provided the missing features (and 
so ended any hopes of a true standard 
without a major redesign of Pascal). 
Wirth, in an attempt to pick up the 
chards of Pascal, and to update the lan¬ 
guage, wrote Modula-2, which does 
encompass these changes and many 
more. Also, some of the annoying fea¬ 
tures of Pascal are omitted such as 
using BEGIN and END for compound 
statements instead of, for example, 

IF . . . END. 

Sale’s edification of Modula-2 is 
aimed at both camps, that is, students 
and professional programmers. The 
first 14 chapters concentrate on 
Modula-2 as a teaching language. This 
is the major strength of the book (and 
the major portion). The concept of 
Dijkstra’s predicate transforms, for 
example, is introduced early in the book 
and is used extensively. Parnas’s infor¬ 
mation hiding is also discussed. These, 
along with other modern software engi¬ 
neering techniques, such as in-line 
transforms, are applied to a relatively 
long Modula-2 program that checks a 
user-supplied personal identification 
number against an internally stored 
value. Sale clearly realizes the pedagogi¬ 
cal value of a large program that is dis¬ 
cussed using the principle of top-down 
design. 

The last sections of the book are 
devoted to Modula-2 as a real-world 
programming language. The concepts 
of modules and procedural types are 
introduced and illustrated using two 
long examples. One program is a syno¬ 
nym token checker and the other is a 
file comparison program. 


Another feature readers might find 
interesting is the meta-notation used to 
describe Modula-2. Instead of the tradi¬ 
tional EBNF notation, a two- 
dimensional EBNF (2DEBNF) notation 
is utilized. This meta-notation, credited 
to Professor G. Rose of the University 
of Queensland, does represent state¬ 
ment syntax better than EBNF. 

The book does have some weaknesses 
in the areas where many beginning pro¬ 
gramming texts are generally weak. One 
brief page is devoted to recursion, 
which, by any stretch of the imagina¬ 
tion, is not enough. Also, even though a 
chapter is devoted to pointers, it is not a 
particularly thorough discussion of the 
variety of data types represented by 
pointers (as pointed out by the author). 
Another minor failing in the book is its 
poor index, which is a real liability to a 
researcher. 

Robert D. Haskins 

Jet Propulsion Laboratory 


Introduction 
to Simulation and 
SLAM II (3rd ed.) 

Alan B.Pritsker (John Wiley & Sons, 

New York, 1986, 839 pp., $34.50) 

I found this book to be both enjoya¬ 
ble to read and useful, even though I 
had never used SLAM or SLAM II. I 
have, however, implemented models in 
programming languages and simulation 
languages similar to SLAM II, and I 
feel that readers with a background 
similar to mine will also use and enjoy 
this book. 

Practitioners of SLAM and SLAM II 
and readers will find a thorough 
description of SLAM II, including fea¬ 
tures implemented in Version 3.1, in 
Chapters 5 through 14. Moreover, 
Chapter 17 describes capabilities that 
augment and support SLAM II in the 
areas of database interfaces, interactive 
processing, and graphics. The informa¬ 
tion in these chapters coupled with the 
more general information in Chapters 1 
through 4 could provide the basis for a 
course in modeling and simulation 
using SLAM II for, as the author sug- 
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gests, college seniors or first-year 
graduate students. Certainly junior- 
level mathematics, statistics, and com¬ 
puter programming should be prerequi¬ 
sites to trying to master the material 
contained therein. 

The book, however, covers modeling 
and simulation in general. Chapters 1 
through 4, 15, 18, and 19 provide an 
overview of the how and why of simula¬ 
tion. Even if you have been or are con¬ 
templating using a simulation language 
other than SLAM II, or you want to 
acquaint yourself with the general topic 
of simulation languages, these chapters 
provide a good introduction to the sub¬ 
ject, including why simulation is useful 
and how it has been and can be 
employed to solve practical problems 
and aid in the decision-making process. 
The author’s comments about simula¬ 
tion on the book jacket are also 
noteworthy. Of particular interest are 
the comments relating the use of net¬ 
works to corporate executives. Indus¬ 
trial engineers and operations researchers 
should read the discussions about 
materials handling and flexible manu¬ 
facturing, topics of current interest to 
professionals in these fields. Chapter 16 
contains the Material Handling Exten¬ 
sion to SLAM II and Chapter 15 
addresses MAP/1, a special-purpose 
simulation language applicable to 
models of manufacturing systems. 

The book is one of the best organized 
texts I have encountered. Each chapter 
contains a table of contents, an intro¬ 
duction, exercises, and references. This 
makes it easy for the reader to easily 
find a topic of interest. A solutions 
manual and an education guide are also 
available from the publisher. 

Some of the reference sections are 
extensive. Chapter 4, Applications of 
Simulation, lists 67 references; Chapter 
19, Statistical Aspects of Simulation, 
contains 88 references. 

Because it is well organized, the text 
could be used for self-instruction. This 
is also a good reference book. 

I did find a typo on page 782 (BATH 
versus BATCH), and I would like to 
make a suggestion for the next edition 
of the book. The simulation of com¬ 
puter systems, which includes hardware 
and software interactions in an inte¬ 
grated fashion, is a topic of interest to 
system architects and analysts. While 
references to the modeling of computer 
systems are made, the topic seems wor¬ 
thy of more extensive treatment and 
would make the book more appealing 
to computer scientists. 


Ralph P. Romanelli 
The BDM Corporation 


IBM's Telecommunications Strategy. 
This 100-page report examines IBM’s 
strategy to achieve market dominance 
in the telecommunications industry. 

This strategy will involve implementa¬ 
tion of IBM’s SNA, ISDN, and LEN 
expertise. Architecture Technology Cor¬ 
poration, PO Box 24344, Minneapolis 
MN 55424; (612) 935-2035; $695. 

The High-tech Industry Manual. 

Hyman Olken’s 114-page book guides 
engineers in how to obtain new high- 
technology spin-offs from government 
research and development programs 
and how to build a profitable new busi¬ 
ness based on such a technology spin¬ 
off. Illustrated and in case-history for¬ 
mat. Hyman Olken, 2830 Kennedy St., 
Livermore, CA 94550; (415) 447-5177; 
$20. 

Boom in industrial training. The indus¬ 
trial training market will climb to $14 
billion in 1997, up from the 1987 figure 
of $6.5 billion, according a report enti¬ 
tled Industrial Training Products & 
Services Markets (#735) issued by a 
market research/product planning firm. 
International Resource Development, 6 
Prowitt St., Norwalk, CT 06855; (203) 
866-7800; $985. 

Topics in C Programming. Stephen 
Kochan and Patrick Wood’s 400-page 
book discusses how to debug C pro¬ 
grams, how to write terminal- 
independent programs, and how to use 
“make” for automatic generation of a 
programming system. Other topics 
include structures and pointers, 
dynamic memory allocation, linked 
lists, tree structures, and dispatch 
tables. Howard W. Sams & Company, 
4300 W. 62nd St., Indianapolis, IN 
46268; (317) 298-5400; $24.95. 

Datapro Manufacturing Automation 
Series. This five-volume loose-leaf 
series on CIM includes “Management 
and Planning,” “Manufacturing Infor¬ 
mation Systems,” “CAD/CAM/CAE 
Systems,” “Factory Automation Sys¬ 
tems,” and “News and Perspectives.” 

It focuses on the informational needs of 
manufacturing directors, engineers, 
architects, and planners. Datapro 
Research, 1221 Avenue of the 
Americas, New York, NY 10020; (800) 
328-2776; $775 (annual subscription). 


New from McGraw-Hill. William E. 
Howden’s Functional Program Testing 
and Analysis uses a scientific program¬ 
ming and a data processing application 
to integrate and unify different testing 
methods and techniques. (288 pp. Order 
No. 0-07-030550-1, $35.95.) 

Software Reliability: Measurement, 
Prediction, Application by J. Musa, A. 
Iannino, and K. Okumoto integrates 
the theoretical and pragmatic aspects of 
software reliability measurement and is 
filled with practical examples. (635 pp., 
Order no.0-07-044093-X, $47.95.) 

McGraw-Hill Book Company, 1221 
Avenue of the Americas, New York, 

NY 10020; (212) 512-3493. 


IBM PS/2 makes clone vendors the 
winners. According to “The IBM Per¬ 
sonal System/2 . . . Implications for 
Users,” IBM’s big competitors will be 
forced to again address the issue of 
IBM emulation for connectivity sup¬ 
port, giving a bigger share of the mar¬ 
ket to PC clone vendors whose products 
already have such support. The article 
appears in Faulkner Report on 
Microcomputers and Software. 

Faulkner Technical Reports, Inc., 6560 
N. Park Dr., Pennsauken, NJ 08109; 
(800) 843-0460. 


Reliability of Computer and Control 
Systems. N. Viswanadham and V. V. S. 
Sarma’s 466-page book presents a state- 
of-the-art methodology for the design 
of reliable computer control systems. 
Various concepts, tools, and techniques 
from such diverse areas as computer 
science, automatic control, reliability 
theory, and process systems engineering 
are presented. North-Holland, PO Box 
1991, 1000 BZ Amsterdam, The Nether¬ 
lands; (020) 5862 911; $250. 


Gaging With Vision Systems. This 
280-page, hardcover book contains 26 
articles and technical papers and is 
organized into six sections that include 
an introduction, quality assurance, 
measurement issues, measurement tech¬ 
niques, and off- and on-line metrology. 
Advantages and limitations of current 
systems and future expectations are also 
discussed. Society of Manufacturing 
Engineers, PO Box 930, Dearborn, MI 
48121; (313) 271-1500; $35. 
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The Eleventh Annual 
International Computer 


Software & Applications Conference 

comnsac87 



Takanawa Prince Hotel, Tokyo, Japan 


Tutorials: October 5-6,1987 
Conference: October 7-9,1987 
Technical Tours: October 12-14,1987 


Four Pre-conference Tutorials on Important Subjects 
(October 5-6,1987} 

• Software Quality Assurance: A Practical Approach 

by Tsun S. Chow, AT&T Bell Laboratories 

• Security of Integrated Engineering Information 
by Robert E. Fulton, Georgia Inst, of Technology 

• Major Technology Impacts on Software Engineering 
by Morio Nagata, Keio Univ.; Toshiro Ohno, Japan 
BAC; Koichi Furukawa, ICOT; Akira Nagashima, IPA; 
and Ken Sakamura, Univ. of Tokyo 

• Managing Software Projects 

by Donald J. Reifer, Reifer Associates 

Each tutorial runs from 9 a.m. to 5 p.m. 


Conference Topic Areas 
October 7-9,1987 

Each day starts with a plenary address or panel 
discussion by worldwide leaders in computer science, 
technology, and industry. 

Four three-day conference tracks on: 

• Software Design and Management 

• Software Quality and Maintenance 

• Databases and Distributed Processing 

• Language Processing, Expert Systems, and 
Protocols 

in 36 technical and management sessions. 


A Software Industry State-of-the-Art International Conference 


Post-Conference Technical Tours • (October 12-14, 1987) 


Toshiba Corporation in Kawasaki city, Kanagawa 
IPA (Information-technology Promotion Agency) 
in Akihabara, Tokyo 
NEC Corporation in Abiko city, Chiba 

Electrotechnical Laboratory in the Tsukuba 
scientific city, Ibaraki 

Hitachi Corporation in Hatano city, Kanagawa 
Nissan Motor Company in Zama, Kanagawa 
NTT Corporation in Yokosuka city, Kanagawa 
Mitsubishi Electric Corporation in Kamakura city, 
Kanagawa 


• Fujitsu Ltd. in Numazu city, Shizuoka 

• Matsushita Industrial Company, Ltd. in Moriguchi 
city, Osaka 

• Kyoto University, Dept, of Information Science in 
Kyoto city 

These are complete tours with bus service to and from the 
Takanawa Prince Hotel each day and lunch with engineers 
and scientists of the companies visited. 


Individuals living outside Asia, 
Australia & New Zealand- 
For additional information and 
registration, contact: 

David H. Jacobsohn 
COMPSAC 87 Registration 
Co-Chair 

5642 South Harper Avenue 
Chicago, IL 60637, USA 
Telephone: 312/752-4562 


Individuals living outside Asia, 
Australia & New Zealand- 
For travel, hotel and additional 
tour arrangements with major 
discount rates, contact the fol¬ 
lowing company ASAP: 
Commerce Tours 
International, Inc. 

870 Market Street, Suite 920 
San Francisco, CA 94102, USA 
Telephone: 415/433-3072 
Fax: 415/788-3280 
Telex: 278124CTI UR 


Australia & New 
Zealand, contact: 

Secretariat 
COMPSAC 87 
c/o Business Center for 
Academic Societies Japan 
Yamazaki Bldg. 4F. 
2-40-14, Hongo, Bunkyo-ku 
Tokyo 113, Japan 
Phone: 03-817-5831 
Fax: 03-817-5836 

Telex: 02722268 BCJSP J 


All inquiries & reservations for 
technical tours should be 
directed to: 

Secretariat 
COMPSAC 87 
c/o Business Center for 
Academic Societies Japan 
Yamazaki Bldg. 4F. 
2-40-14, Hongo, Bunkyo-ku 
Tokyo 113, Japan 
Phone: 03-817-5831 
Fax: 03-817-5836 

Telex: 02722268 BCJSP J 
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YOU ARE INVITED ... 


. . . the only major 
Conference/Exhibi* 
tion on Artificial 
Intelligence on 
the East Coast 

October 28-30, 1987 

Atlantic City Convention 
Center 

Atlantic City, New Jersey 

The Conference 

Emphasis will be on COMMERCIAL 
APPLICATIONS of Artificial Intelligence 
in business, industry and the professions. 
Recent technological developments will 
be examined, with evaluations of ad¬ 
vanced computer hardware and software 
by qualified speakers. 24 Technical Ses¬ 
sions and 11 Tutorials are planned. 


The Exhibition 

Leading suppliers will be displaying their latest 
equipment, systems and technology. Technical 
briefings, product demonstrations and practical 
working data will be made available by the 
exhibitors’ product specialists and senior 
management. 



EAST 








Technical Sessions 

Al in Finance/Venture Community; 

Funding of Al Companies 
Al in Software Design/CAD 
Al in Finance/Mainstream Applications 
Al Applications Program 
Integration of Al Components into 
Complete Systems 
Al in Manufacturing 
Al on Micros 

Al on Standard Architecture 
LISP on Conventional Hardware 
Al on Distributed Processors 
Object-Oriented Programming 
Intelligent Text Retrieval with Al 
Neural Networks 
Al in CAE 
PROLOG 

Al in Mainstream Transaction Processing 

Voice Recognition 

Al in Medicine 

Machine Translation 

Al in Defense 

Al in IBM Environments 

Al in Music 

Al on Wall Street 

Al in the Federal Government 


Eleven Tutorials will offer in-depth pres¬ 
entations on key subjects. 


Who Should Attend? 

Corporate Executives 
Computer Systems Designers 
Office Automation Specialists 
Industrial Engineers 
Government/Military Officials 
Scientists 

Manufacturing Executives 
Academic Researchers 
Computer Specialists 
Banking/Financial Experts 
Industrial R&D Specialists 
Medical Personnel 
Investment Advisors 
Aerospace Scientists 
Office Managers 

... and other buyers of Al products and 
services 


Artificial Intelligence 
and Advanced Computer 
Technology 
Conference/Exhibition 

Mail This Coupon for Details 


TO: Tower Conference Management Co., 

331 W. Wesley St., Wheaton, IL 60187 

□ Please send information on ATTENDING 
Al/East '87 

□ My company is interested in EXHIBITING. 
Please contact me with details. 


Co-Sponsors: 

ESD : isF 


DM III Data Inc.. 


Title- 

Company- 

Address- 

City_ State- 

Phone-:- Te 


Photocopy coupon for your associates 


Applied Artificial 
Intelligence Reporter 

PC Al 

Spang-Robinson Report 

Organized by: 

G™) 

Tower Conference 
Management Company 

331 W. Wesley St., Wheaton, IL 60187 
(312) 668-8100 Telex 350427 
Fax: (312) 668-8820 
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